Files
obitools4/autodoc/docmd/pkg_obitools_obiconvert.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.2 KiB

obiconvert: Semantic Overview of Public Functionalities

The obiconvert package provides a robust, CLI-driven framework for converting and managing biological sequence data within the OBITools4 ecosystem. It enables format-agnostic input parsing, standardized output generation (FASTA/FASTQ/JSON), and configurable preprocessing—while preserving metadata semantics.

Input Handling

  • ExpandListOfFiles(check_ext bool, filenames ...string) []string
    Expands file paths into a deduplicated list of eligible files. Supports local directories, symlinks (resolved), and remote URLs (http(s)://, ftp://).
    When check_ext=true, filters files by extension: .fasta[.gz], .fastq[.fq][.gz], .seq[.gz], .gb[| gbff | dat ][.gz], and .ecopcr[.gz].

  • CLIReadBioSequences(filenames ...string) obiiter.IBioSequence
    Returns a lazy, streaming iterator over biological sequences from files or stdin. Automatically selects parsing strategy based on CLI flags:

    • JSON-style (--input-json-header)
    • OBI-compliant headers (--input-OBI-header, --input-obi)
    • Heuristic auto-detection (default).

    Configurable via CLI options:

    • Parallel workers (nworkers ≥ 2)
    • Batch size and memory limits
    • U→T conversion for RNA (--u-to-t)
    • Skip empty sequences (--skip-empty)

    Handles:

    • Single/multiple files (with batched parallel reading)
    • Paired-end input via --paired-with
    • Fallback readers: FASTA, FASTQ, GenBank/EMBL, ecoPCR output, CSV
  • OpenSequenceDataErrorMessage(args ...string, err error)
    Formats and logs user-friendly errors for input failures (stdin-only / single-file / multi-file), then exits with status 1.

Output Handling

  • CLIWriteBioSequences(iter obiiter.IBioSequence, filenames ...string)
    Writes sequences from an IBioSequence iterator to stdout or files, based on CLI options:

    • Format: FASTQ (if quality scores present), FASTA, JSON (default), or generic sequence.
    • Header style: Configured via CLIOutputFastHeaderFormat()"json" or "obi".
    • Compression: Optional gzip (--gzip).
    • Paired-end output: Automatically splits into _R1, _R2 files via BuildPairedFileNames.
    • Parallelism: Uses configurable workers (WriteParallelWorkers()).
  • BuildPairedFileNames(filename string) (string, string)
    Generates paired-end filenames: sample.fastq → sample_R1.fastq, sample_R2.fastq.

Configuration & Integration

  • OptionSet(allow_paired bool)
    Centralized CLI option setter. Enables modular setup for paired-end support and shared flags.

  • Taxonomy Integration:
    Supports loading taxonomy via obioptions.LoadTaxonomyOptionSet.

  • Progress Reporting:
    Displays a progress bar unless stderr is redirected or stdout pipes to another process.

Design Principles

Lazy evaluation via iterators for memory efficiency
Automatic format inference and parallel I/O scaling
Symlink resolution, recursive globbing with extension filtering
CLI-integrated configuration (header parsing mode, workers, batch size)

All functionality is exposed through public functions and designed for composability with obiformats, obiiter, and obidefault.