# `obiformats` Package โ€” Semantic Overview The **`obiformats`** package provides a unified, extensible framework for parsing and writing biological sequence data in standard bioinformatics formats (FASTA/FASTQ, EMBL, GenBank, CSV, EcoPCR), while supporting streaming, batching, parallelism, and format-agnostic workflows. ## Core Objectives 1. **Format-Agnostic Input**: Automatically detect and parse diverse sequence formats via MIME-type inference. 2. **Streaming & Scalability**: Enable memory-efficient ingestion of large NGS datasets through chunked, concurrent parsing. 3. **Structured Output**: Support flexible export to FASTA/FASTQ, JSON, CSV, Newick, and taxonomy-aware formats. 4. **Interoperability**: Integrate seamlessly with OBITools4 abstractions (`obiseq.BioSequence`, `obiiter.IBioSequence`, `obitax.Taxon`). 5. **Extensibility**: Allow new readers/writers to be plugged in via functional interfaces and options. --- ## Public Functionalities (Grouped by Domain) ### ๐Ÿ“ฅ **Sequence Reading & Parsing** | Function | Format(s) Supported | |---------|---------------------| | `ReadSequencesFromFile`, `ReadSequencesFromStdin` | Auto-detected (FASTA/FASTQ/EMBL/GenBank/EcoPCR/CSV) | | `ReadFasta`, `ReadFastq` | FASTA, FASTQ (with rope/buffered variants) | | `ReadEMBL`, `ReadGenbank` | EMBL, GenBank (rope-aware for large files) | | `ReadCSV`, `ReadEcoPCR` | Tabular/amplicon outputs (e.g., EcoPCR v1/v2) | | `LoadCSVTaxonomy`, `LoadNCBITaxDump` | Taxonomic data (CSV, NCBI dump dir/tar) | - **Concurrent Parsing**: Configurable worker count (`OptionsParallelWorkers`) with ordered batch output. - **Rope-Based Parsing**: Zero-copy parsing for large files (`FastaChunkParserRope`, `EmblChunkParserRope`). - **Header Parsing**: JSON (`ParseFastSeqJsonHeader`) and legacy OBI-style (`ParseOBIFeatures`). - **Quality Handling**: Phred offset adjustment, optional `Uโ†’T` conversion. ### ๐Ÿ“ค **Sequence Writing & Formatting** | Function | Format(s) Supported | |---------|---------------------| | `WriteFasta`, `FormatFastq` | FASTA/FASTQ (single/batch, parallel I/O) | | `WriteJSON` | Structured JSON with annotations (batched + ordered writes) | | `FormatFastaBatch`, `WriteFastqToFile` | Optimized batch formatting with compression | | `CSVTaxaIterator`, `CSVSequenceRecord` | Taxonomic/sequence CSV export (configurable columns) | | `WriteNewick`, `Tree.Newick` | Taxonomy โ†’ Newick tree (with optional annotations) | - **Compression Support**: Automatic gzip/bgzip via `obiutils.CompressStream`. - **Paired-End Handling**: Split forward/reverse reads to separate files. - **Ordered Output**: Preserves sequence order across parallel writes (`WriteFileChunk`). - **Format-Aware Dispatching**: `WriteSequence()` auto-selects FASTQ/FASTA based on quality presence. ### ๐Ÿงฌ **Taxonomy & Metadata Handling** | Function | Purpose | |---------|--------| | `LoadCSVTaxonomy`, `LoadNCBITarTaxDump` | Load taxonomies from CSV/NCBI dumps | | `DetectTaxonomyFormat`, `LoadTaxonomy` | Auto-detect and load taxonomy from diverse sources | | `CSVTaxaIterator`, `WriteNewick` | Export taxonomies to CSV or Newick | | Taxon annotation extraction (e.g., `taxid`, path, rank) | via structured metadata fields | - **Root Enforcement**: Ensures presence of NCBI root (`taxid=1`) during loading. - **Alias Resolution**: Merged taxids mapped to current IDs (`AddAlias`). - **Flexible Output Fields**: CSV/Newick support configurable metadata (scientific name, taxid, rank, path). ### โš™๏ธ **Configuration & Options** - `Options` encapsulates all runtime settings via functional setters (`WithOption`, e.g., `BatchSize(1024)`, `OptionsCompressed(true)`). - Key options include: - I/O: file append/truncate, compression (`OptionsCompressed`) - Parsing: header parser toggle, quality read flag - Export: CSV columns (`CSVId`, `CSVTaxid`), NA value, separator - Taxonomy: include path/root/rank (`OptionWithoutRootPath`, `WithTaxid`) - Performance: parallel workers, buffer size - Defaults ensure safe behavior; options are composable and immutable. ### ๐Ÿงต **Streaming & Chunking Primitives** | Type/Function | Purpose | |---------------|---------| | `PieceOfChunk`, `FileChunk` | Rope-based buffers for zero-copy streaming | | `ReadFileChunk()` | Chunk file by record boundaries (not fixed size) | | `EndOfLastFastaEntry`, `EndOfLastFastqEntry` | Find last complete record in buffer (for safe splitting) | | `ropeScanner`, `_readline__` | Line-by-line scanner over ropes (no full materialization) | | `WriteFileChunk()` | Ordered, thread-safe chunk reassembly | - Designed for **large-file resilience**: avoids full file load; splits only at valid boundaries. - Integrates with `obiiter` for push-style streaming iterators. ### ๐Ÿ” **Format Detection & Discovery** | Function | Role | |---------|------| | `OBIMimeTypeGuesser`, `NGSFilterCsvDetector` | Content-based MIME detection (e.g., FASTA via `>`, EcoPCR via `#@ecopcr-v2`) | | `DetectTaxonomyFormat` | Detects NCBI dump, CSV, FASTA/FASTQ as taxonomy sources | | `OBIMimeNGSFilterTypeGuesser` | Distinguishes legacy vs. CSV NGS filter configs | - Uses `github.com/gabriel-vasile/mimetype` for robust format sniffing. - Preserves unread bytes to allow downstream parsers. ### ๐Ÿ“‹ **Specialized Parsers & Writers** - `ReadCSVFromStdin`, `_ParseFastqFile`: Convenience wrappers for stdin/file I/O. - `JSONRecord()`, `FormatFastaBatch()`: Optimized serialization with minimal allocations. - `_parse_json_*` helpers: High-performance JSON parsing using `jsonparser`. - `WriteFastaToFile`, `_UnescapeUnicodeCharactersInJSON()`: Robust output handling. --- ## Design Principles - **Streaming First**: All parsers return `obiiter.IBioSequence` โ€” lazy, batched iterators. - **Functional Abstraction**: Format handling via `IBatchReader`, `FormatHeader` โ€” decoupled from core logic. - **Extensibility**: New formats added via `ReadSequencesFromFile()` extension points and MIME registration. - **Fail-Safe Defaults**: Empty files โ†’ empty iterator; missing root taxon โ†’ fatal error. - **Ordered Semantics**: Despite parallelism, batches preserve global order via atomic counters (`nextCounter`). --- ## Integration Highlights - **Dependencies**: Uses `obiseq`, `obiiter`, `obitax`, and utilities (`obiutils`/`obidefault`) for core data models. - **Logging**: Structured logs via `logrus` (format detection, errors, progress). - **Error Handling**: Panics on unrecoverable issues; graceful fallbacks (e.g., `ReadEmptyFile`). - **Performance**: Rope-based parsing, zero-copy where possible (`unsafe.String`, buffered writes). > โœ… `obiformats` enables scalable, reproducible NGS data processing โ€” from raw ingestion to structured export.