mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,62 @@
|
||||
# `obiconvert`: Semantic Overview of Public Functionalities
|
||||
|
||||
The `obiconvert` package provides a robust, CLI-driven framework for converting and managing biological sequence data within the OBITools4 ecosystem. It enables format-agnostic input parsing, standardized output generation (FASTA/FASTQ/JSON), and configurable preprocessing—while preserving metadata semantics.
|
||||
|
||||
## Input Handling
|
||||
|
||||
- **`ExpandListOfFiles(check_ext bool, filenames ...string) []string`**
|
||||
Expands file paths into a deduplicated list of eligible files. Supports local directories, symlinks (resolved), and remote URLs (`http(s)://`, `ftp://`).
|
||||
When `check_ext=true`, filters files by extension: `.fasta[.gz]`, `.fastq[.fq][.gz]`, `.seq[.gz]`, `.gb[| gbff | dat ][.gz]`, and `.ecopcr[.gz]`.
|
||||
|
||||
- **`CLIReadBioSequences(filenames ...string) obiiter.IBioSequence`**
|
||||
Returns a lazy, streaming iterator over biological sequences from files or stdin. Automatically selects parsing strategy based on CLI flags:
|
||||
- JSON-style (`--input-json-header`)
|
||||
- OBI-compliant headers (`--input-OBI-header`, `--input-obi`)
|
||||
- Heuristic auto-detection (default).
|
||||
|
||||
Configurable via CLI options:
|
||||
- Parallel workers (`nworkers ≥ 2`)
|
||||
- Batch size and memory limits
|
||||
- `U→T` conversion for RNA (`--u-to-t`)
|
||||
- Skip empty sequences (`--skip-empty`)
|
||||
|
||||
Handles:
|
||||
- Single/multiple files (with batched parallel reading)
|
||||
- Paired-end input via `--paired-with`
|
||||
- Fallback readers: FASTA, FASTQ, GenBank/EMBL, ecoPCR output, CSV
|
||||
|
||||
- **`OpenSequenceDataErrorMessage(args ...string, err error)`**
|
||||
Formats and logs user-friendly errors for input failures (stdin-only / single-file / multi-file), then exits with status `1`.
|
||||
|
||||
## Output Handling
|
||||
|
||||
- **`CLIWriteBioSequences(iter obiiter.IBioSequence, filenames ...string)`**
|
||||
Writes sequences from an `IBioSequence` iterator to stdout or files, based on CLI options:
|
||||
- **Format**: FASTQ (if quality scores present), FASTA, JSON (default), or generic sequence.
|
||||
- **Header style**: Configured via `CLIOutputFastHeaderFormat()` → `"json"` or `"obi"`.
|
||||
- **Compression**: Optional gzip (`--gzip`).
|
||||
- **Paired-end output**: Automatically splits into `_R1`, `_R2` files via `BuildPairedFileNames`.
|
||||
- **Parallelism**: Uses configurable workers (`WriteParallelWorkers()`).
|
||||
|
||||
- **`BuildPairedFileNames(filename string) (string, string)`**
|
||||
Generates paired-end filenames: `sample.fastq → sample_R1.fastq`, `sample_R2.fastq`.
|
||||
|
||||
## Configuration & Integration
|
||||
|
||||
- **`OptionSet(allow_paired bool)`**
|
||||
Centralized CLI option setter. Enables modular setup for paired-end support and shared flags.
|
||||
|
||||
- **Taxonomy Integration**:
|
||||
Supports loading taxonomy via `obioptions.LoadTaxonomyOptionSet`.
|
||||
|
||||
- **Progress Reporting**:
|
||||
Displays a progress bar unless stderr is redirected or stdout pipes to another process.
|
||||
|
||||
## Design Principles
|
||||
|
||||
✅ Lazy evaluation via iterators for memory efficiency
|
||||
✅ Automatic format inference and parallel I/O scaling
|
||||
✅ Symlink resolution, recursive globbing with extension filtering
|
||||
✅ CLI-integrated configuration (header parsing mode, workers, batch size)
|
||||
|
||||
All functionality is exposed through public functions and designed for composability with `obiformats`, `obiiter`, and `obidefault`.
|
||||
Reference in New Issue
Block a user