⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,43 @@
+# Semantic Description of `ReadSequencesBatchFromFiles`
+
+This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
+
+## Core Functionality
+
+- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
+- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
+
+## Concurrency Model
+
+- Launches `concurrent_readers` goroutines to process files in parallel.
+- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
+
+## Streaming Interface
+
+- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
+- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
+
+## Error Handling & Logging
+
+- Panics on file-open failure (via `log.Panicf`).
+- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
+
+## Resource Management
+
+- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
+- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
+
+## Design Intent
+
+- Enables scalable, memory-efficient ingestion of large NGS datasets.
+- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
+- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
+
+## Key Abstractions
+
+| Type/Interface | Role |
+|----------------|------|
+| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
+| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
+| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
+
@@ -0,0 +1,36 @@
+# `obiformats` Package — Semantic Overview
+
+The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
+
+## Core Abstraction
+
+- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
+  ```go
+  func(string, ...WithOption) (obiiter.IBioSequence, error)
+  ```
+- It accepts:
+  - A file path (`string`)
+  - Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
+- Returns:
+  - An iterator over biological sequences (`obiiter.IBioSequence`)
+  - Or an error if the file cannot be opened/parsed
+
+## Semantic Intent
+
+- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
+- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
+- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
+
+## Typical Usage Pattern
+
+1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
+2. Call it with a file path and optional options.
+3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
+
+## Design Principles
+
+- **Functional, minimal API**: Single responsibility—reading and iteration.
+- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
+- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
+
+> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
@@ -0,0 +1,30 @@
+# CSV Import Module for Biological Sequences (`obiformats`)
+
+This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
+
+## Core Features
+
+- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
+- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
+- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
+- **Metadata Handling**:
+  - Special handling for taxonomic IDs (`taxid`, `*_taxid`).
+  - Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
+- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
+- **Multiple Entry Points**:
+  - `ReadCSV`: From any `io.Reader`.
+  - `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
+  - `ReadCSVFromStdin`: Reads from standard input.
+- **Error & Edge Handling**:
+  - Gracefully handles empty files/streams via `ReadEmptyFile`.
+  - Uses structured logging (Logrus) for fatal and informational messages.
+
+## Integration
+
+Designed to integrate with OBItools4’s core types:
+- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
+- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
+
+## Use Case
+
+Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
+# CSVSequenceRecord Function Description
+
+The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
+
+## Core Features
+
+- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
+- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
+- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
+- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
+- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
+- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
+- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
+
+## Design Highlights
+
+- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
+- Handles missing data consistently via `opt.CSVNAValue()`.
+- Supports both standard and user-defined metadata fields.
+- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
+
+This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
+# `CSVTaxaIterator` Function — Semantic Description
+
+The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
+
+### Core Functionality:
+- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
+- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
+
+### Configurable Output Fields (via options):
+- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
+- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
+- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
+- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
+- `scientific_name`: Full scientific name of the taxon.
+- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
+- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
+
+### Implementation Highlights:
+- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
+- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
+- Dynamically builds CSV headers based on selected options before processing begins.
+
+### Use Case:
+Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
+## CSV Taxonomy Loader for OBITools4
+
+This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
+
+### Key Features:
+- **Robust CSV Parsing**: Uses Go’s `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
+- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
+- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
+- **Taxonomy Construction**:
+  - Builds a hierarchical taxonomy using `obitax.Taxon` objects.
+  - Ensures existence of a root node; returns error otherwise.
+- **Metadata Extraction**:
+  - Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
+  - Logs key metadata for traceability.
+- **Scalable Design**:
+  - Processes records line-by-line (memory-efficient).
+  - Supports large datasets via streaming CSV reading.
+
+### Input Format:
+CSV must contain exactly four columns (case-sensitive headers):
+- `taxid`: Unique taxon identifier.
+- `parent`: Parent taxonomic node ID (empty for root).
+- `scientific_name`: Binomial or descriptive name.
+- `taxonomic_rank`: e.g., *species*, *genus*.
+
+### Output:
+Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
+# Semantic Description of `obiformats.WriterDispatcher`
+
+The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
+
+- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
+- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
+- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
+- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
+- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
+- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
+- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
+- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
+
+In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
+# EcoPCR File Parser for Biological Sequences
+
+This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
+
+## Key Features
+
+- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
+- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
+- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
+- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
+  - Name (with deduplication support)
+  - Nucleotide/protein sequence
+  - Comment field
+- **Structured Annotation**: Populates rich annotations including:
+  - Taxonomic hierarchy (taxid, rank, species/genus/family names)
+  - Primer matching info (`forward_match`, `reverse_mismatch`)
+  - Melting temperatures (if present in v2)
+  - Amplicon length and strand orientation
+- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
+- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
+
+## Implementation Highlights
+
+- Custom line reader (`__readline__`) for robust header parsing.
+- CSV parser configured with `|` delimiter and comment support (`#`).
+- Deduplication of sequence names using a running count suffix.
+- Concurrent goroutine-based streaming to decouple I/O and processing.
+
+This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
@@ -0,0 +1,17 @@
+# EMBL Format Parser for OBITools4
+
+This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
+
+- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
+- **Two Parsing Modes**:
+  - `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
+  - `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
+- **Configurable Options**:
+  - `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
+  - `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
+- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
+- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
+- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
+- **Integration**: Outputs are compatible with OBITools4’s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
+
+Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
+## `ReadEmptyFile` Function — Semantic Description
+
+- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
+- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
+- **Signature**:  
+  `func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
+- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
+- **Behavior**:
+  - Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
+  - Immediately closes the stream using `.Close()` — indicating no data will be yielded.
+- **Output**:
+  - Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
+  - Error return is always `nil`, since no I/O occurs and the operation is deterministic.
+
+### Semantic Role & Use Cases
+- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
+- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
+- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
+
+### Design Notes
+- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
+- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
+# FASTA Parser Module (`obiformats`)
+
+This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
+
+## Core Functionalities
+
+- **`FastaChunkParser(UtoT bool)`**  
+  Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
+
+- **`FastaChunkParserRope(...)`**  
+  Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
+
+- **`ReadFasta(reader io.Reader, ...)`**  
+  High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
+
+- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**  
+  Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
+
+- **`EndOfLastFastaEntry(...)`**  
+  Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
+
+## Key Features
+
+- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
+- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
+- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
+- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
+- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
+
+## Design Highlights
+
+- Minimal allocations via rope-based parsing (`extractFastaSeq`).
+- Graceful error reporting with context (source, identifier, invalid char position).
+- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
+# FASTQ Parsing Module (`obiformats`)
+
+This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
+
+## Core Functionalities
+
+- **`EndOfLastFastqEntry(buffer []byte) int`**  
+  Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
+
+- **`FastqChunkParser(...)`**  
+  Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
+  - Header parsing (`@id [definition]`)
+  - Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
+  - Quality score shifting (`quality_shift`)
+  - Strict validation (e.g., `+` line, matching sequence/length)
+
+- **`FastqChunkParserRope(...)`**  
+  Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
+
+- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**  
+  Enables concurrent, chunked parsing of large files:
+  - Splits input into chunks using `ReadFileChunk`
+  - Uses configurable parallel workers (`nworker`)
+  - Pushes parsed batches to an iterator interface
+
+- **Convenience I/O Wrappers**
+  - `ReadFastqFromFile(filename, ...)`: Parses a file by name.
+  - `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
+
+## Key Options & Features
+
+- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
+- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
+- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
+- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
+
+## Design Highlights
+
+- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
+- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
+- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
+## Semantic Description of `obiformats` Package
+
+The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:  
+- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.  
+- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
+
+Two main constructor functions enable flexible formatting:  
+- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.  
+- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
+
+The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
+# Semantic Description of `obiformats` Package
+
+The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
+
+- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
+- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
+
+## Core Functions
+
+- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**  
+  Dynamically routes header parsing based on the first character of the sequence definition:
+  - Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
+  - Otherwise invokes `ParseFastSeqOBIHeader`.
+
+- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**  
+  Applies header parsing to a *batch* of sequences:
+  - Takes an iterator over `BioSequence`s.
+  - Uses optional configuration (e.g., parallelism, parsing behavior).
+  - Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
+
+## Design Principles
+
+- **Format agnosticism**: Automatically detects header type.
+- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
+- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
+
+This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
+# `FormatHeader` Function Type in `obiformats`
+
+The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
+
+- **Package**: `obiformats`  
+  Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
+
+- **Type Definition**:  
+  ```go
+  type FormatHeader func(sequence *obiseq.BioSequence) string
+  ```
+  - A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
+
+- **Semantic Role**:  
+  Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.  
+  Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
+
+- **Usage Context**:  
+  - Used by writers/formatters to produce standardized headers when exporting sequences.  
+  - Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).  
+  - Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
+
+- **Dependencies**:  
+  - Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
+
+- **Design Intent**:  
+  Promotes clean separation of concerns: data (sequence) ↔ formatting logic.  
+  Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
+This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
+
+- **JSON Parsing Helpers**:  
+  It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
+
+- **Header Interpretation**:  
+  `_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
+  - Core fields (`id`, `definition`, `count`)
+  - Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
+  - Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
+
+- **Sequence Annotation Enrichment**:  
+  `ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
+
+- **Serialization Support**:  
+  `WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
+
+- **Error Handling**:  
+  Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
+
+In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
+# OBIFormats Package: Semantic Description
+
+The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
+
+- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
+- Three core parsing functions detect value types:  
+  - `__match__key__`: Identifies assignment patterns (`Key = ...`).  
+  - `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).  
+  - `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).  
+  - `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
+
+- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
+
+- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**  
+  iteratively extracts key-value pairs from a header string and populates an `Annotation` map.  
+  - Numeric values are stored as integers if they have no fractional part.  
+  - Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:  
+    - `*_count` → `map[string]int`,  
+    - `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).  
+    - `*_status`/`*_mutation` → `map[string]string`.  
+
+- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequence’s definition line, moving annotations into its metadata map and preserving leftover text.
+
+- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:  
+  - Strings and booleans use `key=value;`.  
+  - Maps/dicts are JSON-encoded, then single-quoted for compatibility.  
+  - Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
+
+- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
+
+- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
+# FastSeq Reader Module — Semantic Description
+
+This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
+
+## Core Features
+
+- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
+- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
+- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
+- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
+- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
+- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
+- **Stdin & File I/O**: Two entry points:
+  - `ReadFastSeqFromFile(filename, ...)` for regular files.
+  - `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
+- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
+- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
+
+## Integration
+
+Built on top of `obitools4`’s core abstractions:
+- `obiiter.IBioSequence`: Iterator interface for biological sequences.
+- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
+- `obiutils`, `obidefault`: Utilities for path handling and defaults.
+
+Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
+# `obiformats` Package Overview
+
+The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
+
+## Core Formatting Functions
+
+- **`FormatFasta(seq, formater)`**  
+  Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
+
+- **`FormatFastaBatch(batch, formater, skipEmpty)`**  
+  Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
+
+## File Writing Functions
+
+- **`WriteFasta(iterator, file, options...)`**  
+  Writes a stream of sequences to any `io.WriteCloser`. Supports:
+  - Parallel workers (`ParallelWorkers`)
+  - Chunked writing via `WriteFileChunk`
+  - Optional compression (e.g., gzip)  
+  Returns a new iterator mirroring the input for pipeline chaining.
+
+- **`WriteFastaToStdout(iterator, options...)`**  
+  Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
+
+- **`WriteFastaToFile(iterator, filename, options...)`**  
+  Writes to a named file with:
+  - Truncation or append mode (`AppendFile`)
+  - Automatic paired-end output if `HaveToSavePaired()` is enabled  
+    (writes reverse reads to a secondary file specified via `PairedFileName`)
+
+## Key Design Highlights
+
+- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
+- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
+- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
+# FASTQ Output Module (`obiformats`)
+
+This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
+
+## Core Functionality
+
+- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.  
+- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
+
+## Header Customization
+
+- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
+
+## Writing to Streams/Files
+
+- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
+- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
+- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
+  - Append/truncate modes
+  - Paired-end output (splits iterator and writes to two files)
+  - Automatic compression via `obiutils.CompressStream`
+
+## Parallelization & Robustness
+
+- Uses goroutines to parallelize formatting/writing across multiple workers.
+- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
+- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
+
+## Integration
+
+Designed to work seamlessly with the `obitools4` ecosystem:  
+- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
+- Extensible through functional options (`WithOption`) for configuration.
+
+> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
+# `obiformats` Package Overview  
+
+The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
+
+- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
+  
+- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
+
+- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
+
+- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
+
+- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
+
+- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
+
+- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
+
+The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
+# Semantic Description of `obiformats` Package Functionalities
+
+The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
+
+- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
+
+- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
+
+- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
+
+- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
+
+- **`ReadFileChunk()`**: Core function that:
+  - Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
+  - Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
+  - Extends chunks incrementally (e.g., +1 MB) until a full record boundary is found via `splitter`;
+  - Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
+  - Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
+
+- **Key semantics**:  
+  - *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.  
+  - *Lazy evaluation*: only reads ahead when needed to find record boundaries.  
+  - *Streaming-first design* — supports large files without full loading into memory.
+
+This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
+# `WriteFileChunk` Function — Semantic Description
+
+The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
+
+- **Input**:  
+  - `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.  
+  - `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
+
+- **Core Behavior**:  
+  - Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).  
+  - Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).  
+  - If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.  
+  - Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
+
+- **Buffer Management**:  
+  - After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
+
+- **Error Handling**:  
+  - Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
+
+- **Cleanup & Lifecycle**:  
+  - Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.  
+  - Returns the input channel, enabling external producers to stream `FileChunk` structs.
+
+- **Use Case**:  
+  Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
+# GenBank Parser Module (`obiformats`)
+
+This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
+
+## Core Functionalities
+
+- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
+- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
+- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
+- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
+- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
+- **Parallel streaming I/O**:
+  - `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
+  - Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
+
+## Key Design Decisions
+
+- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
+- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
+- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
+- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
+- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
+
+## Output
+
+Returns a batched iterator of `BioSequence` objects, each containing:
+- Identifier (`id`)
+- Compact nucleotide sequence
+- Definition line (as description)
+- Source file origin
+- Optional feature table bytes
+- Annotations: `scientific_name`, `taxid`
+
+Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
+# JSON Output Module for Biological Sequences (`obiformats`)
+
+This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
+
+- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
+  - `"id"`: Sequence identifier.
+  - `"sequence"` (optional): Nucleotide/protein sequence string if present.
+  - `"qualities"` (optional): Quality scores as a string if available.
+  - `"annotations"` (optional): Metadata annotations map.
+
+- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
+
+- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
+  - Parallel workers (configurable via options).
+  - Automatic compression (`gzip`/`bgzip`) if enabled.
+  - Proper JSON array wrapping: `[`, chunked batches, and final `]`.
+  - Atomic ordering to preserve sequence integrity across parallel writes.
+
+- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
+  - Outputs to stdout or a file (with append/truncate control).
+  - Supports paired-end data: writes both forward and reverse reads to separate files when configured.
+
+- **Internal helpers**:
+  - `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9` → `\u00E9`).
+  - Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
+
+Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
+# NCBI Taxonomy Loader Module (`obiformats`)
+
+This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
+
+- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
+- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
+- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
+
+Key features:
+- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
+- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
+- Efficient buffered reading (`bufio.Reader`) for large files.
+- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
+- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
+- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
+
+The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
+## NCBI Taxonomy Archive Support in `obiformats`
+
+This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
+
+### Core Functionalities
+
+1. **Archive Validation (`IsNCBITarTaxDump`)**  
+   - Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.  
+   - Returns a boolean indicating if the archive is a complete NCBI tax dump.
+
+2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**  
+   - Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.  
+   - Steps include:
+     - **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
+     - **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
+     - **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
+   - Sets the root taxon to NCBI’s default (`taxid = 1`, i.e., *root*).
+
+3. **Integration with Other Modules**  
+   - Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
+   - Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
+
+### Key Parameters
+- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
+- `seqAsTaxa`: Reserved for future use; currently unused.
+
+### Logging & Error Handling  
+- Uses `logrus` to log loading progress and counts.
+- Returns descriptive errors if required files or the root taxon are missing.
+
+> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
+# Newick Format Export Functionality in `obiformats`
+
+This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
+
+## Core Components
+
+- `Tree`: A struct modeling a node in a Newick tree, containing:
+  - `Children`: list of child nodes (nested trees),
+  - `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
+  - `Length`: optional branch length (evolutionary distance).
+
+- **`Newick()` methods**:
+  - `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
+    Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
+  - Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
+
+- **Writing Functions**:
+  - `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
+    - Accepts an iterator over taxa (`*obitax.ITaxon`).
+    - Validates single-taxonomy input.
+    - Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
+  - `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
+  - `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
+
+## Configuration Options
+
+Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
+
+## Semantic Summary
+
+The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
+# NGSFilter Configuration Parser — Semantic Overview
+
+This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
+
+## Core Functionality
+
+- **Format Detection**:  
+  `OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.  
+  A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
+
+- **Dual Input Parsing**:
+  - `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
+    - Primer pairs (`forward`, `reverse`)
+    - Tag pairs (with optional `-` for untagged direction)
+    - Experiment/sample metadata
+    - OBIFeatures annotations (via `ParseOBIFeatures`)
+  - `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:  
+    `"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`  
+    Additional columns are stored as annotations.
+
+- **Parameter Configuration**:
+  A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
+  - `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
+  - `tag_delimiter` / directional variants: Symbol separating tags in sequences
+  - `matching`: Tag matching algorithm (e.g., exact, fuzzy)
+  - Error tolerance:  
+    `primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)  
+    `tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
+  - Indel handling:  
+    `indels` / directional variants (`true/false`) to enable/disable indels in primer matching
+
+- **Validation & Integrity Checks**:
+  - `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
+  - Duplicate tag-pair detection per marker (error on reuse).
+  - Strict column/field validation with informative error messages.
+
+- **Logging & Observability**:
+  Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
+
+## Design Highlights
+
+- **Extensibility**: New parameters can be added via `library_parameter` map.
+- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
+- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
+- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
+
+> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
@@ -0,0 +1,14 @@
+# Semantic Description of `obiformats` Package Functionalities  
+
+The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).  
+
+Key capabilities include:  
+- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).  
+- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).  
+- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).  
+- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).  
+- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.  
+- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.  
+- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).  
+
+All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
+# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
+
+The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
+
+## Core Functionality
+
+- **`newRopeScanner(rope *PieceOfChunk)`**  
+  Constructs a new scanner starting at the root of the rope.
+
+- **`ReadLine() []byte`**  
+  Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.  
+  - Returns `nil` when the end of the rope is reached.
+  - Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
+  - The returned slice aliases rope data and is only valid until the next call.
+
+- **`skipToNewline()`**  
+  Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
+
+## Implementation Highlights
+
+- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
+- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
+- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the rope’s underlying data.
+
+## Use Case
+
+Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
+# Taxonomy Loading Module (`obiformats`)
+
+This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
+
+## Core Features
+
+1. **Format Detection**  
+   - `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
+   - Supports:  
+     • NCBI Taxdump (both directory and `.tar` archive)  
+     • CSV files (`text/csv`)  
+     • FASTA/FASTQ sequences (via `mimetype` detection)  
+
+2. **Modular Loaders**  
+   - Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).  
+   - Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
+
+3. **Sequence-Based Taxonomy Extraction**  
+   - For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
+
+4. **Integration with OBITools Ecosystem**  
+   - Leverages `obitax.Taxonomy` as the canonical output structure.  
+   - Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
+
+5. **Error Handling & Logging**  
+   - Graceful failure with descriptive errors; informative logging via `logrus`.
+
+## Usage Flow
+
+```go
+tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
+```
+
+The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
+# OBIFORMATS Package: Semantic Description
+
+The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
+
+It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
+- **FASTA** (`text/fasta`): identified by lines starting with `>`.
+- **FASTQ** (`text/fastq`): detected via leading `@` characters.
+- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
+- **EMBL** (`text/embl`): detected by lines starting with `ID   `.
+- **GenBank** (`text/genbank`): identified by either `LOCUS       ` or legacy `"Genetic Sequence Data Bank"` headers.
+- **CSV** (`text/csv`): generic tabular support.
+
+Core functionality is exposed through:
+- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
+- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
+- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
+
+Internally leverages:
+- `obiutils.Ropen()` for unified file opening (including stdin handling).
+- Path extension stripping and source tagging via `OptionsSource()`.
+- Logging (`logrus`) for format diagnostics.
+- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
+
+The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
+
+Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
+# `obiformats` Package: Sequence Writing Utilities
+
+This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
+
+## Core Functionality
+
+- **`WriteSequence()`**:  
+  Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.  
+  - Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.  
+  - Preserves iterator state via `PushBack()` to allow chaining.
+
+- **`WriteSequencesToStdout()`**:  
+  Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
+
+- **`WriteSequencesToFile()`**:  
+  Writes sequences to a specified file. Supports:
+    - File creation/truncation or append mode (`OptionAppendFile()`).
+    - Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
+
+## Design Highlights
+
+- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
+- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
+- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
+- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
+
+## Integration
+
+Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.