obitools4/autodoc/docmd/pkg_obitools_obiconvert.md

# `obiconvert`: Semantic Overview of Public Functionalities

The `obiconvert` package provides a robust, CLI-driven framework for converting and managing biological sequence data within the OBITools4 ecosystem. It enables format-agnostic input parsing, standardized output generation (FASTA/FASTQ/JSON), and configurable preprocessing—while preserving metadata semantics.

## Input Handling

- **`ExpandListOfFiles(check_ext bool, filenames ...string) []string`**
  Expands file paths into a deduplicated list of eligible files. Supports local directories, symlinks (resolved), and remote URLs (`http(s)://`, `ftp://`).
  When `check_ext=true`, filters files by extension: `.fasta[.gz]`, `.fastq[.fq][.gz]`, `.seq[.gz]`, `.gb[| gbff | dat ][.gz]`, and `.ecopcr[.gz]`.

- **`CLIReadBioSequences(filenames ...string) obiiter.IBioSequence`**
  Returns a lazy, streaming iterator over biological sequences from files or stdin. Automatically selects parsing strategy based on CLI flags:
  - JSON-style (`--input-json-header`)
  - OBI-compliant headers (`--input-OBI-header`, `--input-obi`)
  - Heuristic auto-detection (default).

  Configurable via CLI options:
  - Parallel workers (`nworkers ≥ 2`)
  - Batch size and memory limits
  - `U→T` conversion for RNA (`--u-to-t`)
  - Skip empty sequences (`--skip-empty`)

  Handles:
  - Single/multiple files (with batched parallel reading)
  - Paired-end input via `--paired-with`
  - Fallback readers: FASTA, FASTQ, GenBank/EMBL, ecoPCR output, CSV

- **`OpenSequenceDataErrorMessage(args ...string, err error)`**
  Formats and logs user-friendly errors for input failures (stdin-only / single-file / multi-file), then exits with status `1`.

## Output Handling

- **`CLIWriteBioSequences(iter obiiter.IBioSequence, filenames ...string)`**
  Writes sequences from an `IBioSequence` iterator to stdout or files, based on CLI options:
  - **Format**: FASTQ (if quality scores present), FASTA, JSON (default), or generic sequence.
  - **Header style**: Configured via `CLIOutputFastHeaderFormat()` → `"json"` or `"obi"`.
  - **Compression**: Optional gzip (`--gzip`).
  - **Paired-end output**: Automatically splits into `_R1`, `_R2` files via `BuildPairedFileNames`.
  - **Parallelism**: Uses configurable workers (`WriteParallelWorkers()`).

- **`BuildPairedFileNames(filename string) (string, string)`**
  Generates paired-end filenames: `sample.fastq → sample_R1.fastq`, `sample_R2.fastq`.

## Configuration & Integration

- **`OptionSet(allow_paired bool)`**
  Centralized CLI option setter. Enables modular setup for paired-end support and shared flags.

- **Taxonomy Integration**:
  Supports loading taxonomy via `obioptions.LoadTaxonomyOptionSet`.

- **Progress Reporting**:
  Displays a progress bar unless stderr is redirected or stdout pipes to another process.

## Design Principles

✅ Lazy evaluation via iterators for memory efficiency
✅ Automatic format inference and parallel I/O scaling
✅ Symlink resolution, recursive globbing with extension filtering
✅ CLI-integrated configuration (header parsing mode, workers, batch size)

All functionality is exposed through public functions and designed for composability with `obiformats`, `obiiter`, and `obidefault`.