⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,22 @@
+# `obialign` Package: Sequence Alignment Utilities
+
+The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
+
+- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
+
+- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
+  - At mismatches, it retains the higher-quality base.
+  - When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
+  - Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
+  - Optionally records mismatch statistics in sequence attributes.
+
+- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
+
+- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
+
+- **Alignment Path Format**: The path is a sequence of signed integers encoding:
+  - Negative steps → deletions in seqB (insertion in A),
+  - Positive steps → insertions in B,
+  - Consecutive pairs encode match/mismatch runs.
+
+This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
@@ -0,0 +1,30 @@
+# Semantic Description of `obialign` Backtracking Module
+
+The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
+
+## Core Functionality
+
+- **Input**:  
+  - `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.  
+  - `lseqA`, `lseqB`: Lengths of sequences A and B.  
+  - `path`: Pre-allocated slice to store the traceback path.
+
+- **Output**: A compact representation of alignment steps, alternating between:
+  - Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
+  - Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
+
+## Algorithm Highlights
+
+- **Reverse traversal** from `(lseqA−1, lseqB−1)` to origin.
+- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
+- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
+- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
+
+## Encoded Path Semantics
+
+Each pair in the returned slice encodes:
+- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
+
+## Use Case
+
+Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
@@ -0,0 +1,26 @@
+# Semantic Description of `obialign` Package
+
+This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
+
+## Key Functionalities
+
+- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
+
+- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
+
+- **Log-space arithmetic helpers**:
+  - `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
+  - `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 − exp(a))` and `log(exp(a) − exp(b))`, critical for probability transformations.
+
+- **Match/mismatch scoring (`_MatchScoreRatio`)**:
+  - Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
+  - Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
+
+- **Precomputed scoring matrices**:
+  - `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
+  - `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
+
+- **Thread-safe initialization**:
+  - `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
+
+All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
@@ -0,0 +1,23 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
+
+- **Core Encoding Strategy**:  
+  A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
+
+- **`encodeValues(score, length int, out bool)`**:  
+  Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
+
+- **`decodeValues(value uint64)`**:  
+  Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
+
+- **Utility Bitwise Helpers**:
+  - `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
+  - `_incscore(value)`: increments score by `1 << wsize`.
+  - `_setout(value)`: clears the out-flag, marking value as *not* terminal.
+
+- **Predefined Constants**:
+  - `_empty`: neutral state (score=0, length=0).
+  - `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
+
+This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
@@ -0,0 +1,42 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
+
+## Core Algorithm
+
+- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
+- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
+- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
+
+## Scoring Scheme
+
+- **Match**: +1 point  
+- **Mismatch or gap (indel)**: 0 points  
+
+## Key Functions
+
+1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`  
+   - Computes LCS score and alignment length between raw byte sequences.  
+   - If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).  
+   - Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.  
+   - Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
+
+2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
+   - Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.  
+   - Designed for standard biosequence inputs.
+
+3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
+   - Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
+
+## Features
+
+- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
+- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
+- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
+- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
+
+## Use Cases
+
+- Molecular barcode/UMI clustering  
+- Read-to-reference alignment in amplicon sequencing  
+- Similarity filtering of biological sequences
@@ -0,0 +1,15 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
+
+- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
+- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).  
+- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
+- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
+- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
+- **Lookup tables**:
+  - `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
+  - `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
+- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
+
+The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
@@ -0,0 +1,19 @@
+## `obialign` Package: Semantic Overview (≤50 lines)
+
+The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
+
+- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
+- `D1Or0(...)`: Returns a 4-tuple:
+  - **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
+  - **`int` (second)**: Position of the differing site (`-1` if identical).
+  - **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
+
+**Algorithmic strategy:**
+1. Early rejection if length difference exceeds 1.
+2. Forward scan until first mismatch → identifies left boundary of divergence.
+3. Backward scan from ends to find rightmost match boundary.
+4. Validates whether the mismatch region allows exactly one edit:
+   - Single substitution: equal lengths, single divergent position.
+   - Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
+
+Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
@@ -0,0 +1,29 @@
+# `LocatePattern` Functionality Overview
+
+The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
+
+- **Input**:  
+  - `id`: identifier for logging/error reporting.  
+  - `pattern []byte`: the query sequence (e.g., primer).  
+  - `sequence []byte`: the target read/contig.  
+
+- **Constraints**:  
+  - Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).  
+
+- **Scoring Scheme**:  
+  - Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).  
+  - Mismatch/Gap: `-1`.  
+
+- **Algorithm Features**:  
+  - End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.  
+  - Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).  
+  - Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).  
+  - Backtracks from the bottom-right to find optimal local alignment start/end coordinates.  
+
+- **Output**:  
+  - `start`: starting index in `sequence`.  
+  - `end+1`: ending index (exclusive) of best match.  
+  - Error count: `-score`, i.e., number of mismatches/gaps in alignment.  
+
+- **Use Case**:  
+  Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
@@ -0,0 +1,37 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
+
+## Core Functionalities
+
+### 1. **Memory Arena Management**
+- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
+- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
+
+### 2. **Dynamic Programming Alignment Functions**
+Implements three specialized global alignment variants using Needleman–Wunsch with affine gap penalties (scaled per mismatch):
+
+- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
+- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
+- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
+
+All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
+
+### 3. **Scoring & Quality Integration**
+- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
+  - Nucleotide compatibility (via precomputed `_NucPartMatch`)
+  - Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
+  - A user-defined `scale` factor to modulate mismatch penalties.
+
+### 4. **Fast Heuristic Pre-Alignment**
+The main `PEAlign` function integrates a kmer-based fast pre-screening:
+- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
+- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
+- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
+
+### 5. **Backtracking & Path Output**
+- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
+- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
+
+### Use Case
+Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
@@ -0,0 +1,58 @@
+# Semantic Description of `obialign.ReadAlign`
+
+The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
+
+## Core Functionality
+
+- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:  
+  - `gap`: gap penalty (linear)  
+  - `scale`: scaling factor for quality scores  
+  - `delta`: extension buffer around initial overlap estimate  
+  - `fastScoreRel`: use relative vs absolute k-mer matching score  
+
+## Algorithm Overview
+
+1. **Preprocessing & Initialization**  
+   - Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).  
+
+2. **Fast Overlap Estimation via 4-mer Indexing**  
+   - Builds a k-mer index of `seqA` using `obikmer.Index4mer`.  
+   - Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.  
+   - Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).  
+
+3. **Overlap Computation**  
+   - Determines overlap length `over` based on shift:  
+     ```text
+       over = |seqA| - shift    if shift > 0  
+              |seqB| + shift    if shift < 0  
+              min(|seqA|,|seqB)| otherwise
+     ```
+
+4. **Dynamic Programming Alignment**  
+   - If overlap is *not* identical (`fastCount + 3 < over`):  
+     - Extracts subregions with `delta`-buffered boundaries.  
+     - Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.  
+     - Backtracks via `_Backtracking` to produce alignment path.  
+   - Else (near-perfect overlap):  
+     - Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.  
+     - Returns trivial path `[extra5, partLen]`.
+
+## Output
+
+Returns:  
+
+| Index | Type     | Meaning |
+|-------|----------|---------|
+| 0️⃣    | `int`     | Final alignment score (weighted by quality) |
+| 1️⃣    | `[]int`   | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
+| 2️⃣    | `int`     | K-mer match count (`fastCount`) |
+| 3️⃣    | `int`     | Overlap length (`over`) |
+| 4️⃣    | `float64` | K-mer-based score (`fastScore`) |
+| 5️⃣    | `bool`    | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
+
+## Key Design Highlights
+
+- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.  
+- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.  
+- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).  
+- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
@@ -0,0 +1,25 @@
+# Apat Package: Pattern Matching for Biological Sequences
+
+The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
+
+## Core Types
+
+- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).  
+- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
+
+## Key Functions & Methods
+
+- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.  
+- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).  
+- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.  
+- `IsMatching(...)`: Boolean check for presence of at least one match in a region.  
+- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.  
+- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.  
+- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).  
+- `Free()`, `Len()`: Explicit memory cleanup and length queries.
+
+## Implementation Notes
+
+Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
+
+This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
@@ -0,0 +1,32 @@
+# Semantic Description of `obiapat` Package Functionality
+
+The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
+
+## Core Functionality
+
+- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**  
+  Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
+
+  - `pattern`: A string where:
+    - Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
+    - Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
+    - Exclamation `!` marks positions where **errors** (substitutions) are permitted.
+  - `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
+  - `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
+
+## Behavior & Semantics
+
+- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
+- Supports three modes:
+  - **Exact matching** (`errormax = 0`, `allowsIndel = false`).
+  - **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
+  - **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
+
+## Testing Coverage
+
+The provided test suite validates:
+- Valid pattern parsing across different configurations.
+- Correct handling of `nil` vs. non-nil output pointers.
+- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
+
+In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
@@ -0,0 +1,27 @@
+# PCR Simulation Module (`obiapat`)
+
+This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
+
+## Key Functionalities
+
+- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
+- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
+- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
+- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
+- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
+- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
+
+## Core API
+
+- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
+- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
+- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
+
+## Implementation Details
+
+- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
+- Handles circular topology by wrapping indices around sequence boundaries.
+- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
+- Logs critical errors with `logrus`; debug-level details for amplicon generation.
+
+Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
@@ -0,0 +1,23 @@
+## Semantic Description of `IsPatternMatchSequence`
+
+The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
+
+### Core Functionality:
+- **Input Parameters**  
+  - `pattern`: A regular expression-like string describing the target pattern.  
+  - `errormax`: Maximum allowed mismatches (substitutions only by default).  
+  - `bothStrand`: If true, also search on the reverse-complement strand.  
+  - `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
+
+- **Internal Workflow**  
+  - Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.  
+  - Computes its reverse complement for dual-strand matching.  
+  - Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
+
+- **Matching Logic**  
+  - Converts input sequence to `apat` format.  
+  - Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.  
+  - Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
+
+### Semantic Use Case:
+Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.
@@ -0,0 +1,15 @@
+# `ISequenceChunk` Function — Semantic Description
+
+The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
+
+- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
+- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
+- Optional parameters allow fine-tuning:
+  - `dereplicate`: enables deduplication of identical sequences.
+  - `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
+  - `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
+  - `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
+
+The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
+
+This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
+# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
+
+The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
+
+Key capabilities include:
+
+- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
+- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
+- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
+- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
+- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
+
+Internally, `ISequenceChunkOnDisk` uses:
+- `obiiter.MakeIBioSequence()` to build the output iterator,
+- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
+- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
+
+Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
+# `ISequenceChunkOnMemory` Function — Semantic Description
+
+The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
+
+It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
+
+Key features:
+- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
+- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
+- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
+- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
+- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
+
+Input:
+- An iterator (`obiiter.IBioSequence`) of raw sequences.
+- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
+
+Output:
+- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
+
+Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
@@ -0,0 +1,26 @@
+# Semantic Description of `obichunk` Package
+
+The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
+
+## Core Concepts
+
+- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
+- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
+
+## Supported Functionalities
+
+- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
+- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
+- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
+- **Batch Processing Control**:
+  - `OptionBatchCount(number)` sets the number of batches.
+  - `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
+- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
+- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
+- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
+
+## Design Highlights
+
+- Functional options pattern for extensibility and readability.
+- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
+- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
@@ -0,0 +1,29 @@
+# Semantic Description of `obichunk.ISequenceSubChunk`
+
+The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
+
+## Core Functionality
+
+- **Input**:  
+  - An iterator over `BioSequence` batches (`obiiter.IBioSequence`)  
+  - A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code  
+  - A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism  
+
+- **Processing**:  
+  - Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.  
+  - For each incoming `BioSequenceBatch`:  
+    - If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.  
+    - Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.  
+    - If the batch has ≤1 sequence, it’s passed through unchanged (but reordered with a new order ID).  
+
+- **Ordering Mechanism**:  
+  - Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.  
+  - Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).  
+
+- **Output**:  
+  - Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.  
+  - Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
+
+## Semantic Purpose
+
+Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
@@ -0,0 +1,45 @@
+# Semantic Description of `IUniqueSequence` Functionality
+
+The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
+
+## Core Workflow
+
+1. **Input Processing**  
+   Accepts an input sequence iterator and optional configuration via `WithOption`.
+
+2. **Parallelization Strategy**  
+   Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
+
+3. **Data Splitting Phase**  
+   - Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).  
+   - Ensures deterministic chunking for reproducibility.
+
+4. **Storage Choice**  
+   - *In-memory*: via `ISequenceChunkOnMemory`.  
+   - *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
+
+5. **Uniqueness Classification**  
+   - Builds a composite classifier combining:
+     - Sequence identity (`SequenceClassifier`)
+     - Optional annotation categories (e.g., sample, primer), with NA handling.
+   - If no annotations are specified, only raw sequence identity is used.
+
+6. **Singleton Filtering**  
+   Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
+
+7. **Parallel Dereplication**  
+   - Spawns worker goroutines to process chunks.
+   - Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
+
+8. **Output Merging**  
+   - Aggregates results using `IMergeSequenceBatch`, preserving:
+     - Sequence counts
+     - Statistics (if enabled)
+     - NA handling and ordering
+
+## Key Features
+
+- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
+- **Configurable**: Via functional options (`Options`).
+- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
+- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
@@ -0,0 +1,28 @@
+# Aho-Corasick-Based Sequence Analysis in `obicorazick`
+
+This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
+
+## Core Components
+
+- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**  
+  Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.  
+  - Scans each sequence *forward* and its reverse complement.
+  - Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
+  - Attaches match counts as sequence attributes.
+
+- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**  
+  Compiles a *single* matcher and returns a predicate function.  
+  - Returns `true` if the number of matches ≥ `minMatches`.
+  - Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
+
+## Technical Highlights
+
+- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
+- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
+- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
+- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
+
+## Use Cases
+
+- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
+- Filtering or annotating sequences based on pattern presence/abundance.
@@ -0,0 +1,34 @@
+# ObiDefault Package: Batch Configuration Module
+
+This Go module provides centralized configuration for sequence batching in Obitools, supporting both **count-based** and **memory-aware** batch processing.
+
+## Core Features
+
+- `_BatchSize` / `SetBatchSize()`  
+  Defines and configures the *minimum* number of sequences per batch (default: `1`).  
+  Used internally as `minSeqs` in `RebatchBySize`.
+
+- `_BatchSizeMax()` / `SetBatchSizeMax()`  
+  Sets the *maximum* sequences per batch (default: `2000`). Batches are flushed upon reaching this limit, regardless of memory.
+
+- **CLI & Environment Integration**  
+  Batch size is determined by `--batch-size` CLI flag and/or the `OBIBATCHSIZE` environment variable (via parsing logic not shown here but implied by comments).
+
+- `_BatchMem()` / `SetBatchMem(n int)`  
+  Configures the *maximum memory per batch* (default: `128 MB`). A value of `0` disables memory-based batching, falling back to pure count-based logic.
+
+- `_BatchMemStr()`  
+  Stores the *raw CLI string* passed to `--batch-mem` (e.g., `"256M"`, `"1G"`), enabling human-readable input parsing elsewhere.
+
+## Utility Functions
+
+- `BatchSizePtr()`, `BatchMemPtr()`  
+  Expose pointers to internal variables for direct modification or inter-process sharing.
+
+- `BatchSizeMaxPtr()`, `BatchMemStrPtr()`  
+  Provide read/write access to max-size and raw memory string values.
+
+## Design Intent
+
+- Separates **configuration** (defaults, CLI/env parsing) from **processing logic**, enabling modular and testable batch handling.
+- Supports both scalable, large-scale processing (via count limits) and memory-constrained environments (via soft RAM caps).
@@ -0,0 +1,35 @@
+# Output Compression Control Module
+
+This Go package (`obidefault`) provides a simple, global configuration mechanism for toggling output compression behavior across an application.
+
+## Core Features
+
+- **Global Compression Flag**: A package-level boolean variable `__compress__` (default: `false`) controls whether output should be compressed.
+- **Read Access**:  
+  - `CompressOutput()` returns the current compression setting as a boolean.
+- **Write Access**:  
+  - `SetCompressOutput(b bool)` updates the compression flag to a new value.
+- **Pointer Access**:  
+  - `CompressOutputPtr()` returns a pointer to the internal flag, enabling indirect modification (e.g., for UI bindings or reflection-based updates).
+
+## Design Intent
+
+- Minimal, side-effect-free API.
+- Thread-safety *not* guaranteed — intended for use in single-threaded initialization or controlled environments.
+- Encapsulation via unexported variable `__compress__`, enforced through accessor functions.
+
+## Typical Usage
+
+```go
+// Enable compression globally:
+obidefault.SetCompressOutput(true)
+
+if obidefault.CompressOutput() {
+    // Apply compression logic (e.g., gzip, brotli)
+}
+```
+
+## Notes
+
+- The double underscore prefix (`__compress__`) signals internal/private status (convention, not enforced).
+- Designed for runtime configurability without recompilation.
@@ -0,0 +1,38 @@
+# `obidefault` Package — Semantic Overview
+
+This minimal Go package provides a centralized, mutable global flag for controlling warning verbosity across an application.
+
+## Core Functionality
+
+- **`__silent_warning__`**:  
+  A package-level boolean variable (unexported) that determines whether warnings should be suppressed.
+
+- **`SilentWarning() bool`**:  
+  A read-only accessor returning the current state of `__silent_warning__`. Enables safe, non-mutating checks elsewhere in the codebase.
+
+- **`SilentWarningPtr() *bool`**:  
+  Returns a pointer to `__silent_warning__`, allowing external code (e.g., CLI parsers, config loaders) to directly mutate the flag — e.g., `*SilentWarningPtr() = true`.
+
+## Design Intent
+
+- **Simplicity & Centralization**:  
+  Avoids scattering warning-control logic; provides a single source of truth.
+
+- **Flexibility**:  
+  Supports both *read-only* inspection (via `SilentWarning()`) and *global mutation* (via pointer), useful for early initialization phases.
+
+- **Explicit Semantics**:  
+  When `SilentWarning()` returns `true`, all warning-generating code *should* suppress output (implementation responsibility lies outside this package).
+
+## Usage Example
+
+```go
+// Suppress warnings globally:
+*obidefault.SilentWarningPtr() = true
+
+if !obidefault.SilentWarning() {
+    log.Println("⚠️ Warning: something happened")
+}
+```
+
+> **Note**: The double underscore prefix on `__silent_warning__` signals internal/private status, discouraging direct access.
@@ -0,0 +1,33 @@
+# Progress Bar Control Module (`obidefault`)
+
+This Go package provides a simple, global mechanism to enable or disable progress bar display across an application.
+
+## Core Functionality
+
+- **`ProgressBar()`**: Returns `true` if progress bars are *enabled* (i.e., when `__no_progress_bar__` is `false`).  
+- **`NoProgressBar()`**: Returns the current state of `__no_progress_bar__`, i.e., whether progress bars are *disabled*.  
+- **`SetNoProgressBar(b bool)`**: Sets the global flag `__no_progress_bar__`. Passing `true` disables progress bars; passing `false` enables them.  
+- **`NoProgressBarPtr()`**: Returns a pointer to the internal `__no_progress_bar__` variable, allowing direct read/write access (e.g., for reflection or UI binding).
+
+## Design Intent
+
+- Centralizes progress bar visibility control in one place.
+- Supports both boolean query/set and pointer-based manipulation for flexibility (e.g., CLI flags, config binding).
+- Uses a *negative* flag name (`__no_progress_bar__`) internally to default progress bars **on** (i.e., `false` → enabled).
+
+## Usage Example
+
+```go
+// Disable progress bars globally:
+obidefault.SetNoProgressBar(true)
+
+// Check status:
+if !obidefault.ProgressBar() {
+    log.Println("Progress bars are disabled.")
+}
+```
+
+## Notes
+
+- Thread-safety is *not* guaranteed; concurrent access should be externally synchronized.
+- The double underscore prefix (`__no_progress_bar__`) signals internal/private usage per Go convention (though not enforced).
@@ -0,0 +1,26 @@
+# Quality Shift and Read/Write Control Module
+
+This Go package (`obidefault`) provides configurable controls over quality score handling in sequence data processing (e.g., FASTQ files). It defines three global variables and corresponding accessor/mutator functions:
+
+- `_Quality_Shift_Input`: Input quality score offset (default: `33`, i.e., Phred+33/Sanger format).
+- `_Quality_Shift_Output`: Output quality score offset (default: `33`), allowing format conversion.
+- `_Read_Qualities`: Boolean flag indicating whether quality scores should be parsed/processed (`true` by default).
+
+## Public API
+
+| Function | Purpose |
+|---------|--------|
+| `SetReadQualitiesShift(shift byte)` | Sets the quality score offset for *input* data (e.g., when reading FASTQ). |
+| `ReadQualitiesShift() byte` | Returns the current input quality offset. |
+| `SetWriteQualitiesShift(shift byte)` | Sets the quality score offset for *output* data (e.g., when writing FASTQ). |
+| `WriteQualitiesShift() byte` | Returns the current output quality offset. |
+| `SetReadQualities(read bool)` | Enables/disables reading/processing of quality scores. |
+| `ReadQualities() bool` | Returns whether qualities are currently being read/used. |
+
+## Semantic Use Cases
+
+- **Format Interoperability**: Allows seamless conversion between Phred+33 (Sanger), Phred+64, or other quality encodings.
+- **Performance Optimization**: Disabling `ReadQualities` skips parsing of quality strings, useful when only sequences are needed.
+- **Centralized Configuration**: Global state enables consistent behavior across modules without passing parameters.
+
+All functions are thread-unsafe by design—intended for initialization before concurrent processing begins.
@@ -0,0 +1,21 @@
+# `obidefault` Package: Configuration State Management
+
+This Go package provides a centralized, thread-safe(ish) configuration layer for taxonomy-related settings in the OBIDMS (Open Biological and Biomedical Data Management System) framework. It exposes simple getters, setters, and pointer accessors for four core boolean/string flags that control how taxonomic identifiers (taxids) are handled during data processing.
+
+## Core Configuration Flags
+
+- `__taxonomy__`: Stores the currently selected taxonomy (e.g., `"NCBI"`, `"UNIPROT"`).  
+- `__alternative_name__`: Enables/disables use of alternative taxonomic names (e.g., synonyms).  
+- `__fail_on_taxonomy__`: If true, processing halts on taxonomy mismatches/errors.  
+- `__update_taxid__`: If true, taxids are auto-updated to current NCBI/DB versions.  
+- `__raw_taxid__`: If true, raw (unprocessed) taxids are preserved instead of normalized.
+
+## Public API
+
+- **Getters**: `UseRawTaxids()`, `SelectedTaxonomy()`, `HasSelectedTaxonomy()`, etc., return current values.  
+- **Pointer Accessors**: e.g., `SelectedTaxonomyPtr()` returns a pointer for direct mutation (advanced use).  
+- **Setters**: `SetSelectedTaxonomy()`, `SetAlternativeNamesSelected()`, etc., update state.
+
+## Use Case
+
+Typically used at application startup to configure global behavior (e.g., `SetSelectedTaxonomy("NCBI")`, `SetUpdateTaxid(true)`), then referenced by downstream modules during data import, validation, or mapping. Minimalist and explicit—no external dependencies.
@@ -0,0 +1,35 @@
+# Obidefault: Parallelism Configuration Module
+
+This Go package (`obideault`) provides a centralized, configurable interface for managing parallel execution parameters—particularly useful in I/O- and CPU-bound workloads.
+
+## Core Concepts
+
+- **CPU-aware defaults**: Automatically detects available cores via `runtime.NumCPU()`.
+- **Configurable workers per core**:
+  - General: `_WorkerPerCore` (default `1.0`)
+  - Read-specific: `_ReadWorkerPerCore` (`0.25`, i.e., ~1 reader per 4 cores)
+  - Write-specific: `_WriteWorkerPerCore` (`0.25`)
+- **Strict overrides**: Allow hardcoding worker counts via `SetStrictReadWorker()`/`Write...`, bypassing per-core scaling.
+
+## Public API
+
+| Function | Purpose |
+|---------|--------|
+| `ParallelWorkers()` | Total workers = `MaxCPU() × WorkerPerCore` |
+| `Read/WriteParallelWorkers()` | Resolves to strict count if set, else per-core calculation (min 1) |
+| `ParallelFilesRead()` | Files read in parallel: defaults to `ReadParallelWorkers()`, overridable |
+| Getters (`MaxCPU`, `WorkerPerCore`, etc.) | Expose current settings safely |
+| Setters (`Set*`) | Dynamically adjust behavior at runtime |
+
+## Configuration Sources
+
+- **Command-line flags**: e.g., `--max-cpu` or `-m`
+- **Environment variable**: `OBIMAXCPU`
+
+## Design Highlights
+
+✅ Decouples resource discovery from policy  
+✅ Supports both *proportional* (per-core) and *absolute* (strict) worker definitions  
+✅ Ensures non-zero defaults for critical paths (`ReadParallelWorkers` ≥ 1)  
+
+⚠️ **Note**: `WriteParallelWorkers()` contains a likely bug—returns `_StrictReadWorker` in the else branch instead of `StrictWriteWorker`.
@@ -0,0 +1,28 @@
+# `obidist` Package: Efficient Symmetric Distance/Similarity Matrix Management
+
+The `*DistMatrix` type provides a memory-efficient, symmetric matrix implementation for distance or similarity data.
+
+- **Storage Strategy**: Only the upper triangle (i < j) is stored, reducing memory usage from *O(n²)* to *n(n−1)/2*.
+- **Diagonal Handling**: Diagonal entries are fixed (0.0 for distances, 1.0 for similarities); assignments to diagonal indices are silently ignored.
+- **Symmetry Guarantee**: `Get(i, j)` and `Set(i, j, v)` automatically handle both (i,j) and (j,i), ensuring consistency.
+
+## Constructors
+
+| Function | Description |
+|---------|-------------|
+| `NewDistMatrix(n)` / `WithLabels(labels)` | Creates *n×n* distance matrix (diag = 0). |
+| `NewSimilarityMatrix(n)` / `WithLabels(labels)` | Creates *n×n* similarity matrix (diag = 1). |
+
+## Core Operations
+
+- `Get(i, j)` / `Set(i, j, v)`: Access/update symmetric entries.
+- `Size() int`, `GetLabel(i)` / `SetLabel(i, label)`: Query/mutate element labels.
+- `Labels() []string`, `GetRow(i)` / `GetColumn(j)`: Retrieve full rows/columns (as copies).
+
+## Analysis Helpers
+
+- `MinDistance()`, `MaxDistance()` → `(value, i, j)` of the extremal off-diagonal entry.
+- `Copy() *DistMatrix`: Deep copy for immutability-safe operations.
+- `ToFullMatrix()` → `[][]float64`: Converts to dense representation (use sparingly).
+
+Designed for clustering, phylogenetics, or any domain requiring fast symmetric matrix access with minimal footprint.
@@ -0,0 +1,28 @@
+# `obidist` Package: Semantic Feature Overview
+
+The `obidist` Go package provides two core data structures for managing **distance** and **similarity matrices**, with built-in guarantees suitable for scientific computing (e.g., clustering, phylogenetics). Key features include:
+
+- **`DistMatrix`**: A symmetric `n×n` matrix representing pairwise distances, where:
+  - Diagonal entries are *always* `0.0` (self-distance).
+  - Off-diagonals obey symmetry: `dist(i, j) == dist(j, i)`.
+  - Automatic enforcement via dedicated `Set()`/`Get()` methods.
+  
+- **`SimilarityMatrix`**: A symmetric matrix where:
+  - Diagonal entries are *always* `1.0`.
+  - Off-diagonals represent similarity scores (e.g., between `0` and `1`, though not enforced).
+  - Symmetry is similarly guaranteed.
+
+Both matrix types support:
+- **Optional labels**: Associate human-readable identifiers (e.g., sample names) with rows/columns.
+- **Safe bounds checking**: Panics on out-of-range access (tested via `defer/recover`).
+- **Deep copy support**: Ensures isolation between original and copied instances.
+- **Utility methods**:
+  - `MinDistance()` / `MaxDistance()`: Return extremal values and their indices.
+  - `GetRow(i)`: Retrieve a full row as a slice (symmetric copy).
+  - `ToFullMatrix()`: Export the matrix as an immutable 2D slice.
+
+Edge cases are rigorously handled:
+- Empty (`n=0`) and singleton (`n=1`) matrices return `(0.0, -1, -1)` for min/max.
+- Label mutations do not affect internal state via defensive copying.
+
+All behaviors are validated through comprehensive unit tests, emphasizing correctness and robustness.
@@ -0,0 +1,43 @@
+# Semantic Description of `ReadSequencesBatchFromFiles`
+
+This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
+
+## Core Functionality
+
+- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
+- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
+
+## Concurrency Model
+
+- Launches `concurrent_readers` goroutines to process files in parallel.
+- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
+
+## Streaming Interface
+
+- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
+- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
+
+## Error Handling & Logging
+
+- Panics on file-open failure (via `log.Panicf`).
+- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
+
+## Resource Management
+
+- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
+- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
+
+## Design Intent
+
+- Enables scalable, memory-efficient ingestion of large NGS datasets.
+- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
+- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
+
+## Key Abstractions
+
+| Type/Interface | Role |
+|----------------|------|
+| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
+| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
+| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
+
@@ -0,0 +1,36 @@
+# `obiformats` Package — Semantic Overview
+
+The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
+
+## Core Abstraction
+
+- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
+  ```go
+  func(string, ...WithOption) (obiiter.IBioSequence, error)
+  ```
+- It accepts:
+  - A file path (`string`)
+  - Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
+- Returns:
+  - An iterator over biological sequences (`obiiter.IBioSequence`)
+  - Or an error if the file cannot be opened/parsed
+
+## Semantic Intent
+
+- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
+- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
+- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
+
+## Typical Usage Pattern
+
+1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
+2. Call it with a file path and optional options.
+3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
+
+## Design Principles
+
+- **Functional, minimal API**: Single responsibility—reading and iteration.
+- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
+- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
+
+> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
@@ -0,0 +1,30 @@
+# CSV Import Module for Biological Sequences (`obiformats`)
+
+This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
+
+## Core Features
+
+- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
+- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
+- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
+- **Metadata Handling**:
+  - Special handling for taxonomic IDs (`taxid`, `*_taxid`).
+  - Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
+- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
+- **Multiple Entry Points**:
+  - `ReadCSV`: From any `io.Reader`.
+  - `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
+  - `ReadCSVFromStdin`: Reads from standard input.
+- **Error & Edge Handling**:
+  - Gracefully handles empty files/streams via `ReadEmptyFile`.
+  - Uses structured logging (Logrus) for fatal and informational messages.
+
+## Integration
+
+Designed to integrate with OBItools4’s core types:
+- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
+- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
+
+## Use Case
+
+Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
+# CSVSequenceRecord Function Description
+
+The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
+
+## Core Features
+
+- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
+- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
+- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
+- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
+- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
+- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
+- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
+
+## Design Highlights
+
+- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
+- Handles missing data consistently via `opt.CSVNAValue()`.
+- Supports both standard and user-defined metadata fields.
+- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
+
+This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
+# `CSVTaxaIterator` Function — Semantic Description
+
+The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
+
+### Core Functionality:
+- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
+- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
+
+### Configurable Output Fields (via options):
+- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
+- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
+- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
+- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
+- `scientific_name`: Full scientific name of the taxon.
+- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
+- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
+
+### Implementation Highlights:
+- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
+- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
+- Dynamically builds CSV headers based on selected options before processing begins.
+
+### Use Case:
+Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
+## CSV Taxonomy Loader for OBITools4
+
+This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
+
+### Key Features:
+- **Robust CSV Parsing**: Uses Go’s `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
+- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
+- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
+- **Taxonomy Construction**:
+  - Builds a hierarchical taxonomy using `obitax.Taxon` objects.
+  - Ensures existence of a root node; returns error otherwise.
+- **Metadata Extraction**:
+  - Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
+  - Logs key metadata for traceability.
+- **Scalable Design**:
+  - Processes records line-by-line (memory-efficient).
+  - Supports large datasets via streaming CSV reading.
+
+### Input Format:
+CSV must contain exactly four columns (case-sensitive headers):
+- `taxid`: Unique taxon identifier.
+- `parent`: Parent taxonomic node ID (empty for root).
+- `scientific_name`: Binomial or descriptive name.
+- `taxonomic_rank`: e.g., *species*, *genus*.
+
+### Output:
+Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
+# Semantic Description of `obiformats.WriterDispatcher`
+
+The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
+
+- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
+- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
+- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
+- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
+- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
+- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
+- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
+- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
+
+In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
+# EcoPCR File Parser for Biological Sequences
+
+This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
+
+## Key Features
+
+- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
+- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
+- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
+- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
+  - Name (with deduplication support)
+  - Nucleotide/protein sequence
+  - Comment field
+- **Structured Annotation**: Populates rich annotations including:
+  - Taxonomic hierarchy (taxid, rank, species/genus/family names)
+  - Primer matching info (`forward_match`, `reverse_mismatch`)
+  - Melting temperatures (if present in v2)
+  - Amplicon length and strand orientation
+- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
+- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
+
+## Implementation Highlights
+
+- Custom line reader (`__readline__`) for robust header parsing.
+- CSV parser configured with `|` delimiter and comment support (`#`).
+- Deduplication of sequence names using a running count suffix.
+- Concurrent goroutine-based streaming to decouple I/O and processing.
+
+This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
@@ -0,0 +1,17 @@
+# EMBL Format Parser for OBITools4
+
+This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
+
+- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
+- **Two Parsing Modes**:
+  - `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
+  - `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
+- **Configurable Options**:
+  - `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
+  - `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
+- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
+- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
+- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
+- **Integration**: Outputs are compatible with OBITools4’s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
+
+Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
+## `ReadEmptyFile` Function — Semantic Description
+
+- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
+- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
+- **Signature**:  
+  `func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
+- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
+- **Behavior**:
+  - Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
+  - Immediately closes the stream using `.Close()` — indicating no data will be yielded.
+- **Output**:
+  - Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
+  - Error return is always `nil`, since no I/O occurs and the operation is deterministic.
+
+### Semantic Role & Use Cases
+- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
+- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
+- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
+
+### Design Notes
+- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
+- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
+# FASTA Parser Module (`obiformats`)
+
+This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
+
+## Core Functionalities
+
+- **`FastaChunkParser(UtoT bool)`**  
+  Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
+
+- **`FastaChunkParserRope(...)`**  
+  Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
+
+- **`ReadFasta(reader io.Reader, ...)`**  
+  High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
+
+- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**  
+  Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
+
+- **`EndOfLastFastaEntry(...)`**  
+  Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
+
+## Key Features
+
+- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
+- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
+- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
+- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
+- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
+
+## Design Highlights
+
+- Minimal allocations via rope-based parsing (`extractFastaSeq`).
+- Graceful error reporting with context (source, identifier, invalid char position).
+- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
+# FASTQ Parsing Module (`obiformats`)
+
+This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
+
+## Core Functionalities
+
+- **`EndOfLastFastqEntry(buffer []byte) int`**  
+  Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
+
+- **`FastqChunkParser(...)`**  
+  Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
+  - Header parsing (`@id [definition]`)
+  - Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
+  - Quality score shifting (`quality_shift`)
+  - Strict validation (e.g., `+` line, matching sequence/length)
+
+- **`FastqChunkParserRope(...)`**  
+  Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
+
+- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**  
+  Enables concurrent, chunked parsing of large files:
+  - Splits input into chunks using `ReadFileChunk`
+  - Uses configurable parallel workers (`nworker`)
+  - Pushes parsed batches to an iterator interface
+
+- **Convenience I/O Wrappers**
+  - `ReadFastqFromFile(filename, ...)`: Parses a file by name.
+  - `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
+
+## Key Options & Features
+
+- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
+- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
+- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
+- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
+
+## Design Highlights
+
+- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
+- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
+- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
+## Semantic Description of `obiformats` Package
+
+The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:  
+- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.  
+- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
+
+Two main constructor functions enable flexible formatting:  
+- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.  
+- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
+
+The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
+# Semantic Description of `obiformats` Package
+
+The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
+
+- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
+- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
+
+## Core Functions
+
+- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**  
+  Dynamically routes header parsing based on the first character of the sequence definition:
+  - Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
+  - Otherwise invokes `ParseFastSeqOBIHeader`.
+
+- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**  
+  Applies header parsing to a *batch* of sequences:
+  - Takes an iterator over `BioSequence`s.
+  - Uses optional configuration (e.g., parallelism, parsing behavior).
+  - Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
+
+## Design Principles
+
+- **Format agnosticism**: Automatically detects header type.
+- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
+- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
+
+This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
+# `FormatHeader` Function Type in `obiformats`
+
+The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
+
+- **Package**: `obiformats`  
+  Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
+
+- **Type Definition**:  
+  ```go
+  type FormatHeader func(sequence *obiseq.BioSequence) string
+  ```
+  - A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
+
+- **Semantic Role**:  
+  Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.  
+  Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
+
+- **Usage Context**:  
+  - Used by writers/formatters to produce standardized headers when exporting sequences.  
+  - Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).  
+  - Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
+
+- **Dependencies**:  
+  - Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
+
+- **Design Intent**:  
+  Promotes clean separation of concerns: data (sequence) ↔ formatting logic.  
+  Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
+This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
+
+- **JSON Parsing Helpers**:  
+  It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
+
+- **Header Interpretation**:  
+  `_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
+  - Core fields (`id`, `definition`, `count`)
+  - Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
+  - Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
+
+- **Sequence Annotation Enrichment**:  
+  `ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
+
+- **Serialization Support**:  
+  `WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
+
+- **Error Handling**:  
+  Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
+
+In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
+# OBIFormats Package: Semantic Description
+
+The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
+
+- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
+- Three core parsing functions detect value types:  
+  - `__match__key__`: Identifies assignment patterns (`Key = ...`).  
+  - `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).  
+  - `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).  
+  - `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
+
+- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
+
+- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**  
+  iteratively extracts key-value pairs from a header string and populates an `Annotation` map.  
+  - Numeric values are stored as integers if they have no fractional part.  
+  - Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:  
+    - `*_count` → `map[string]int`,  
+    - `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).  
+    - `*_status`/`*_mutation` → `map[string]string`.  
+
+- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequence’s definition line, moving annotations into its metadata map and preserving leftover text.
+
+- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:  
+  - Strings and booleans use `key=value;`.  
+  - Maps/dicts are JSON-encoded, then single-quoted for compatibility.  
+  - Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
+
+- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
+
+- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
+# FastSeq Reader Module — Semantic Description
+
+This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
+
+## Core Features
+
+- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
+- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
+- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
+- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
+- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
+- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
+- **Stdin & File I/O**: Two entry points:
+  - `ReadFastSeqFromFile(filename, ...)` for regular files.
+  - `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
+- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
+- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
+
+## Integration
+
+Built on top of `obitools4`’s core abstractions:
+- `obiiter.IBioSequence`: Iterator interface for biological sequences.
+- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
+- `obiutils`, `obidefault`: Utilities for path handling and defaults.
+
+Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
+# `obiformats` Package Overview
+
+The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
+
+## Core Formatting Functions
+
+- **`FormatFasta(seq, formater)`**  
+  Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
+
+- **`FormatFastaBatch(batch, formater, skipEmpty)`**  
+  Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
+
+## File Writing Functions
+
+- **`WriteFasta(iterator, file, options...)`**  
+  Writes a stream of sequences to any `io.WriteCloser`. Supports:
+  - Parallel workers (`ParallelWorkers`)
+  - Chunked writing via `WriteFileChunk`
+  - Optional compression (e.g., gzip)  
+  Returns a new iterator mirroring the input for pipeline chaining.
+
+- **`WriteFastaToStdout(iterator, options...)`**  
+  Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
+
+- **`WriteFastaToFile(iterator, filename, options...)`**  
+  Writes to a named file with:
+  - Truncation or append mode (`AppendFile`)
+  - Automatic paired-end output if `HaveToSavePaired()` is enabled  
+    (writes reverse reads to a secondary file specified via `PairedFileName`)
+
+## Key Design Highlights
+
+- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
+- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
+- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
+# FASTQ Output Module (`obiformats`)
+
+This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
+
+## Core Functionality
+
+- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.  
+- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
+
+## Header Customization
+
+- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
+
+## Writing to Streams/Files
+
+- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
+- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
+- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
+  - Append/truncate modes
+  - Paired-end output (splits iterator and writes to two files)
+  - Automatic compression via `obiutils.CompressStream`
+
+## Parallelization & Robustness
+
+- Uses goroutines to parallelize formatting/writing across multiple workers.
+- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
+- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
+
+## Integration
+
+Designed to work seamlessly with the `obitools4` ecosystem:  
+- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
+- Extensible through functional options (`WithOption`) for configuration.
+
+> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
+# `obiformats` Package Overview  
+
+The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
+
+- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
+  
+- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
+
+- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
+
+- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
+
+- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
+
+- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
+
+- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
+
+The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
+# Semantic Description of `obiformats` Package Functionalities
+
+The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
+
+- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
+
+- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
+
+- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
+
+- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
+
+- **`ReadFileChunk()`**: Core function that:
+  - Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
+  - Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
+  - Extends chunks incrementally (e.g., +1 MB) until a full record boundary is found via `splitter`;
+  - Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
+  - Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
+
+- **Key semantics**:  
+  - *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.  
+  - *Lazy evaluation*: only reads ahead when needed to find record boundaries.  
+  - *Streaming-first design* — supports large files without full loading into memory.
+
+This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
+# `WriteFileChunk` Function — Semantic Description
+
+The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
+
+- **Input**:  
+  - `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.  
+  - `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
+
+- **Core Behavior**:  
+  - Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).  
+  - Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).  
+  - If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.  
+  - Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
+
+- **Buffer Management**:  
+  - After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
+
+- **Error Handling**:  
+  - Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
+
+- **Cleanup & Lifecycle**:  
+  - Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.  
+  - Returns the input channel, enabling external producers to stream `FileChunk` structs.
+
+- **Use Case**:  
+  Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
+# GenBank Parser Module (`obiformats`)
+
+This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
+
+## Core Functionalities
+
+- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
+- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
+- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
+- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
+- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
+- **Parallel streaming I/O**:
+  - `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
+  - Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
+
+## Key Design Decisions
+
+- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
+- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
+- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
+- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
+- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
+
+## Output
+
+Returns a batched iterator of `BioSequence` objects, each containing:
+- Identifier (`id`)
+- Compact nucleotide sequence
+- Definition line (as description)
+- Source file origin
+- Optional feature table bytes
+- Annotations: `scientific_name`, `taxid`
+
+Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
+# JSON Output Module for Biological Sequences (`obiformats`)
+
+This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
+
+- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
+  - `"id"`: Sequence identifier.
+  - `"sequence"` (optional): Nucleotide/protein sequence string if present.
+  - `"qualities"` (optional): Quality scores as a string if available.
+  - `"annotations"` (optional): Metadata annotations map.
+
+- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
+
+- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
+  - Parallel workers (configurable via options).
+  - Automatic compression (`gzip`/`bgzip`) if enabled.
+  - Proper JSON array wrapping: `[`, chunked batches, and final `]`.
+  - Atomic ordering to preserve sequence integrity across parallel writes.
+
+- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
+  - Outputs to stdout or a file (with append/truncate control).
+  - Supports paired-end data: writes both forward and reverse reads to separate files when configured.
+
+- **Internal helpers**:
+  - `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9` → `\u00E9`).
+  - Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
+
+Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
+# NCBI Taxonomy Loader Module (`obiformats`)
+
+This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
+
+- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
+- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
+- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
+
+Key features:
+- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
+- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
+- Efficient buffered reading (`bufio.Reader`) for large files.
+- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
+- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
+- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
+
+The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
+## NCBI Taxonomy Archive Support in `obiformats`
+
+This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
+
+### Core Functionalities
+
+1. **Archive Validation (`IsNCBITarTaxDump`)**  
+   - Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.  
+   - Returns a boolean indicating if the archive is a complete NCBI tax dump.
+
+2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**  
+   - Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.  
+   - Steps include:
+     - **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
+     - **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
+     - **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
+   - Sets the root taxon to NCBI’s default (`taxid = 1`, i.e., *root*).
+
+3. **Integration with Other Modules**  
+   - Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
+   - Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
+
+### Key Parameters
+- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
+- `seqAsTaxa`: Reserved for future use; currently unused.
+
+### Logging & Error Handling  
+- Uses `logrus` to log loading progress and counts.
+- Returns descriptive errors if required files or the root taxon are missing.
+
+> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
+# Newick Format Export Functionality in `obiformats`
+
+This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
+
+## Core Components
+
+- `Tree`: A struct modeling a node in a Newick tree, containing:
+  - `Children`: list of child nodes (nested trees),
+  - `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
+  - `Length`: optional branch length (evolutionary distance).
+
+- **`Newick()` methods**:
+  - `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
+    Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
+  - Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
+
+- **Writing Functions**:
+  - `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
+    - Accepts an iterator over taxa (`*obitax.ITaxon`).
+    - Validates single-taxonomy input.
+    - Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
+  - `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
+  - `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
+
+## Configuration Options
+
+Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
+
+## Semantic Summary
+
+The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
+# NGSFilter Configuration Parser — Semantic Overview
+
+This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
+
+## Core Functionality
+
+- **Format Detection**:  
+  `OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.  
+  A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
+
+- **Dual Input Parsing**:
+  - `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
+    - Primer pairs (`forward`, `reverse`)
+    - Tag pairs (with optional `-` for untagged direction)
+    - Experiment/sample metadata
+    - OBIFeatures annotations (via `ParseOBIFeatures`)
+  - `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:  
+    `"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`  
+    Additional columns are stored as annotations.
+
+- **Parameter Configuration**:
+  A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
+  - `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
+  - `tag_delimiter` / directional variants: Symbol separating tags in sequences
+  - `matching`: Tag matching algorithm (e.g., exact, fuzzy)
+  - Error tolerance:  
+    `primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)  
+    `tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
+  - Indel handling:  
+    `indels` / directional variants (`true/false`) to enable/disable indels in primer matching
+
+- **Validation & Integrity Checks**:
+  - `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
+  - Duplicate tag-pair detection per marker (error on reuse).
+  - Strict column/field validation with informative error messages.
+
+- **Logging & Observability**:
+  Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
+
+## Design Highlights
+
+- **Extensibility**: New parameters can be added via `library_parameter` map.
+- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
+- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
+- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
+
+> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
@@ -0,0 +1,14 @@
+# Semantic Description of `obiformats` Package Functionalities  
+
+The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).  
+
+Key capabilities include:  
+- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).  
+- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).  
+- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).  
+- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).  
+- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.  
+- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.  
+- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).  
+
+All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
+# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
+
+The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
+
+## Core Functionality
+
+- **`newRopeScanner(rope *PieceOfChunk)`**  
+  Constructs a new scanner starting at the root of the rope.
+
+- **`ReadLine() []byte`**  
+  Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.  
+  - Returns `nil` when the end of the rope is reached.
+  - Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
+  - The returned slice aliases rope data and is only valid until the next call.
+
+- **`skipToNewline()`**  
+  Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
+
+## Implementation Highlights
+
+- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
+- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
+- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the rope’s underlying data.
+
+## Use Case
+
+Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
+# Taxonomy Loading Module (`obiformats`)
+
+This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
+
+## Core Features
+
+1. **Format Detection**  
+   - `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
+   - Supports:  
+     • NCBI Taxdump (both directory and `.tar` archive)  
+     • CSV files (`text/csv`)  
+     • FASTA/FASTQ sequences (via `mimetype` detection)  
+
+2. **Modular Loaders**  
+   - Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).  
+   - Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
+
+3. **Sequence-Based Taxonomy Extraction**  
+   - For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
+
+4. **Integration with OBITools Ecosystem**  
+   - Leverages `obitax.Taxonomy` as the canonical output structure.  
+   - Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
+
+5. **Error Handling & Logging**  
+   - Graceful failure with descriptive errors; informative logging via `logrus`.
+
+## Usage Flow
+
+```go
+tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
+```
+
+The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
+# OBIFORMATS Package: Semantic Description
+
+The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
+
+It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
+- **FASTA** (`text/fasta`): identified by lines starting with `>`.
+- **FASTQ** (`text/fastq`): detected via leading `@` characters.
+- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
+- **EMBL** (`text/embl`): detected by lines starting with `ID   `.
+- **GenBank** (`text/genbank`): identified by either `LOCUS       ` or legacy `"Genetic Sequence Data Bank"` headers.
+- **CSV** (`text/csv`): generic tabular support.
+
+Core functionality is exposed through:
+- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
+- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
+- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
+
+Internally leverages:
+- `obiutils.Ropen()` for unified file opening (including stdin handling).
+- Path extension stripping and source tagging via `OptionsSource()`.
+- Logging (`logrus`) for format diagnostics.
+- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
+
+The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
+
+Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
+# `obiformats` Package: Sequence Writing Utilities
+
+This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
+
+## Core Functionality
+
+- **`WriteSequence()`**:  
+  Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.  
+  - Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.  
+  - Preserves iterator state via `PushBack()` to allow chaining.
+
+- **`WriteSequencesToStdout()`**:  
+  Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
+
+- **`WriteSequencesToFile()`**:  
+  Writes sequences to a specified file. Supports:
+    - File creation/truncation or append mode (`OptionAppendFile()`).
+    - Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
+
+## Design Highlights
+
+- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
+- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
+- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
+- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
+
+## Integration
+
+Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
@@ -0,0 +1,13 @@
+## Uint128 Type in `obifp`: Semantic Overview
+
+This Go package defines a custom 128-bit unsigned integer type (`Uint128`) composed of two `uint64` limbs (high and low). It provides comprehensive arithmetic, comparison, bitwise operations, and type conversions.
+
+- **Basic Constructors**: `Zero()`, `MaxValue()` initialize the smallest/largest possible values.
+- **State Checks**: `IsZero()`, and equality/comparison methods (`Equals`, `Cmp`, `<`, `>`, etc.) enable conditional logic.
+- **Type Casting**: Safe conversions to/from smaller (`Uint64`, `uint64`) and larger (`Uint256`) integer types, with overflow warnings where applicable.
+- **Arithmetic**: Full support for addition (`Add`, `Add64`), subtraction (`Sub`), multiplication (`Mul`, `Mul64`) — with panic on overflow.
+- **Division & Modulo**: Integer division (`Div`, `Div64`) and remainder (`Mod`, `Mod64`), implemented via optimized quotient-remainder pairs (`QuoRem`, `QuoRem64`) using hardware-assisted 64-bit operations.
+- **Bit Manipulation**: Left/right shifts (`LeftShift`, `RightShift`), and bitwise logic: AND, OR, XOR, NOT.
+- **Utility**: Direct access to low limb via `AsUint64()`.
+
+All operations preserve 128-bit precision, with strict overflow checking for correctness in high-precision contexts (e.g., bioinformatics counting).
@@ -0,0 +1,17 @@
+# `obifp.Uint128` Package — Semantic Feature Overview
+
+This Go package provides a 128-bit unsigned integer type (`Uint128`) with comprehensive arithmetic, comparison, and bitwise operations. Internally represented as two `uint64` limbs (`w1`: high, `w0`: low), it supports:
+
+- **Arithmetic Operations**  
+  - `Add`, `Sub`, `Mul` (128×128), and `Mul64` (scalar multiplication)  
+  - Division: `Div`, `Mod`, and combined quotient/remainder via `QuoRem` (and their 64-bit variants)  
+- **Comparison & Equality**  
+  - `Cmp`, `Equals`, `LessThan`/`GreaterThan`, and their inclusive variants (`≤`, `≥`)  
+  - Support for comparing against both `Uint128` and native `uint64` values  
+- **Bitwise Operations**  
+  - Logical AND (`And`), OR (`Or`), XOR (`Xor`) between two `Uint128`s  
+  - Bitwise NOT (`Not`) — inverts all bits of the value  
+- **Conversion & Utility**  
+  - `AsUint64()` safely truncates to lower 64 bits (assumes upper limb is zero)  
+
+All operations handle overflow/underflow correctly, including carry propagation in addition and borrow handling in subtraction. Tests cover edge cases: zero values, max `uint64` boundaries (e.g., wrapping in addition/subtraction), and large multiplications. Designed for cryptographic or high-precision numeric use where native integer types are insufficient.
@@ -0,0 +1,30 @@
+# Uint256 Type and Operations — Semantic Overview
+
+The `obifp` package provides a custom 256-bit unsigned integer type (`Uint256`) implemented in Go, composed of four 64-bit limbs (`w0` to `w3`). It supports arithmetic, comparison, bitwise operations, and safe casting with overflow detection.
+
+- **Core Representation**: `Uint256` stores values as four 64-bit words, enabling arbitrary-precision unsigned integers up to $2^{256} - 1$.
+
+- **Utility Methods**:
+  - `Zero()` / `MaxValue()`: Return the neutral and maximum values.
+  - `IsZero()`, `Equals(v)`, comparison methods (`LessThan`, etc.): Enable logical and ordering checks.
+
+- **Casting & Conversion**:
+  - `Uint64()`, `Uint128()` downcast with warnings on overflow.
+  - `Set64(v)`: Initializes from a standard `uint64`.
+  - `AsUint64()`: Direct access to least-significant limb.
+
+- **Bitwise Operations**:
+  - `And`, `Or`, `Xor`, `Not`: Standard bitwise logic per limb.
+
+- **Shifts**:
+  - `LeftShift(n)` / `RightShift(n)`: Multi-limb shifts with carry propagation.
+
+- **Arithmetic**:
+  - `Add(v)`, `Sub(v)` / `Mul(v)`: Use Go’s `math/bits` for carry-aware operations; panic on overflow.
+  - `Div(v)`: Implements long division via repeated subtraction of shifted multiples; panics on zero divisor.
+
+- **Safety & Logging**:
+  - Warnings via `obilog.Warnf` for silent overflows during narrowing casts.
+  - Panics on arithmetic overflow or division-by-zero using `log.Panicf`.
+
+This type is suitable for cryptographic, genomic (OBITools), or high-precision counting use cases requiring precise control over large unsigned integers.
@@ -0,0 +1,34 @@
+# Uint64 Type Functionalities Overview
+
+The `obifp` package provides a custom `Uint64` type wrapping Go’s native 64-bit unsigned integer (`uint64`) to support arithmetic, bitwise operations, and type conversions in a structured way.
+
+## Core Operations
+
+- **`Zero()` / `MaxValue()`**: Returns the zero and maximum representable values, respectively.
+- **`IsZero()` / `Equals(v)`**: Checks if the value is zero or equal to another.
+- **`Cmp(v)`, `LessThan(v)`**, etc.: Standard comparison operations returning `-1/0/+1` or boolean results.
+
+## Arithmetic with Overflow Detection
+
+- **Add/Sub/Mul**: Performs 64-bit addition, subtraction, and multiplication.
+  - Uses `math/bits` for low-level operations (`bits.Add64`, etc.).
+  - Panics on overflow (carry ≠ 0), enforcing strict safety.
+
+## Bitwise Operations
+
+- **`And`, `Or`, `Xor`, `Not()`**: Standard bitwise logic operations.
+- **`LeftShift(n)` / `RightShift(n)`**:
+  - Shifts bits left/right by *n* positions.
+  - Uses internal `LeftShift64`/`RightShift64`, supporting *carry-in* for multi-word arithmetic.
+
+## Extended Precision Conversions
+
+- **`Uint128()` / `Uint256()`**: Casts the 64-bit value into larger unsigned integer types (zero-extended).
+- **`Set64(v)`**: Reassigns the internal value from a raw `uint64`.
+
+## Utility & Logging
+
+- **`AsUint64()`**: Extracts the underlying `uint64`.
+- **Warning on overflow in shift operations** (e.g., shifts ≥ 128 bits) via `obilog.Warnf`.
+
+> Designed for use in high-precision or cryptographic contexts where explicit overflow handling and type safety are critical.
@@ -0,0 +1,32 @@
+# Obifp Package: Generic Fixed-Point Unsigned Integer Operations
+
+This Go package (`obifp`) provides a generic, type-safe interface for fixed-point unsigned integer arithmetic over three size variants: `Uint64`, `Uint128`, and `Uint256`.
+
+## Core Interface: `FPUint[T]`
+
+The interface defines a unified API for unsigned integer types, supporting:
+
+- **Initialization & Conversion**:  
+  - `Zero()`, `Set64(v)`: Create zero or set from a `uint64`.  
+  - `AsUint64()`: Downcast to standard `uint64`.
+
+- **Logical Operations**:  
+  - Bitwise: `And`, `Or`, `Xor`, `Not`.  
+  - Shifts: `LeftShift(n)`, `RightShift(n)`.
+
+- **Arithmetic**:  
+  - Addition (`Add`), subtraction (`Sub`), multiplication (`Mul`). Division is commented out—likely reserved for future implementation.
+
+- **Comparison**:  
+  - Full ordering: `<`, `<=`, `>`, `>=`.
+
+- **Utility Predicates**:  
+  - `IsZero()` for zero-checking.
+
+## Helper Functions
+
+- `ZeroUint[T]`: Returns the neutral element (zero) for type `T`.  
+- `OneUint[T]`: Constructs value 1 via `Set64(1)`.  
+- `From64[T]`: Converts a standard Go `uint64` into the generic type.
+
+All operations are **method-chaining friendly** (return `T`, not pointers), enabling fluent syntax. The design promotes correctness and performance in cryptographic or financial contexts where large, fixed-size integers are required.
@@ -0,0 +1,30 @@
+# `obigraph` Package: Semantic Overview
+
+The `obigraph` package provides a generic, type-safe undirected/directed graph implementation in Go. Its core features include:
+
+- **Generic Graph Structure**: Parametrized over vertex type `V` and edge data type `T`, enabling flexible use with arbitrary user-defined types.
+- **Bidirectional Edge Tracking**: Maintains both forward (`Edges`) and reverse (`ReverseEdges`) adjacency maps for efficient neighbor/parent queries.
+- **Edge Management**:
+  - `AddEdge`: Adds an *undirected* edge (inserted in both directions).
+  - `AddDirectedEdge`: Adds a *directed* edge (only one direction).
+  - `SetAsDirectedEdge`: Converts an existing undirected edge into a directed one by removing the reverse link.
+- **Graph Queries**:
+  - `Neighbors(v)`: Returns all adjacent vertices (outgoing in directed case).
+  - `Parents(v)`: Returns incoming neighbors via reverse adjacency.
+  - `Degree(v)` / `ParentDegree(v)`: Compute vertex degrees (total or incoming).
+- **Customizable Vertex/Edge Properties**:
+  - `VertexWeight`, `EdgeWeight`: Funcs to assign weights (default: constant weight = 1.0).
+  - `VertexId`: Custom vertex label generator (default: `"V%d"`).
+
+- **GML Export**:
+  - `Gml(...)` / `WriteGml(...)`: Generates or writes a Graph Modelling Language (GML) representation.
+  - Supports directed/undirected modes, degree-based filtering (`min_degree`), and visual styling:
+    - Vertex shape: `circle` if weight ≥ threshold, else `rectangle`.
+    - Size scaled by square root of vertex weight.
+  - Uses Go’s `text/template` for rendering.
+
+- **File I/O**: Directly writes GML to file via `WriteGmlFile(...)`.
+
+- **Logging & Safety**: Uses Logrus for bounds-checking errors; panics on template parsing/writing failures.
+
+The package is designed for lightweight, high-performance graph modeling and visualization-ready export.
@@ -0,0 +1,14 @@
+# `obigraph.GraphBuffer` Feature Overview
+
+The `GraphBuffer[V, T]` type provides a **thread-safe graph construction interface** using buffered edge insertion via Go channels.
+
+- **Asynchronous Edge Addition**: Edges are enqueued through a `chan Edge[T]`, processed in the background by a goroutine that updates an underlying static graph (`Graph[V, T]`).  
+- **Non-blocking API**: `AddEdge` and `AddDirectedEdge` are non-synchronous — they send to the channel without waiting for graph mutation, enabling high-throughput edge ingestion.  
+- **Graph Initialization**: `NewGraphBuffer` initializes both the graph and a dedicated worker goroutine to consume edges.  
+- **GML Export Support**: Full support for exporting the final graph in [Graph Modelling Language (GML)](https://en.wikipedia.org/wiki/Graph_Modelling_Language), with optional filtering (`min_degree`) and layout parameters (`threshold`, `scale`).  
+- **File & Stream Output**: Methods `WriteGml` and `WriteGmlFile` allow writing GML to any `io.Writer`, including files.  
+- **Resource Cleanup**: The explicit `Close()` method terminates the worker goroutine by closing the channel, ensuring clean shutdown.  
+- **Generic Design**: Fully generic over vertex (`V`) and edge data types (`T`), supporting arbitrary value semantics.  
+
+> ⚠️ **Note**: The buffer is *not* safe for concurrent `AddEdge` calls without external synchronization beyond channel semantics.  
+> ✅ Ideal for producer-consumer patterns where edges are streamed from multiple goroutines into a single graph.
@@ -0,0 +1,29 @@
+# BioSequenceBatch: A Container for Ordered Biological Sequences
+
+`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
+
+## Core Properties
+- **`source`**: String identifying the origin (e.g., file, pipeline stage).
+- **`order`**: Integer defining processing sequence or priority.
+- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
+
+## Key Functionalities
+- **Construction**:  
+  `MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
+- **Accessors**:  
+  `Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
+- **Mutation (via copy)**:  
+  `Reorder(newOrder)` returns a new batch with updated order.
+- **Size & emptiness**:  
+  `Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
+- **Consumption**:  
+  `Pop0()` removes and returns the first sequence (FIFO behavior).
+- **Safety**:  
+  `IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
+
+## Design Notes
+- Instances are value types (struct), enabling safe copying.
+- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
+- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
+
+This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
+# `obiiter`: Stream-Based Biosequence Iterator Library
+
+This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
+
+## Core Functionality
+
+- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
+- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
+- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
+
+## Iterator Management
+
+- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
+- **Lifecycle Control**:
+  - `Add(n)`, `Done()`: Track active workers (like goroutines).
+  - `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
+  - `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
+
+## Batch Transformation & Reorganization
+
+- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
+- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
+- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
+- **Concatenation & Pooling**:
+  - `Concat(...)`: Sequentially merges multiple iterators.
+  - `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
+
+## Filtering & Predicate-Based Processing
+
+- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
+- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
+- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
+
+## Utility & Analysis
+
+- **`Load()`**: Collects all sequences into a single slice (for small datasets).
+- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
+- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
+- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
+
+## Additional Features
+
+- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
+- Batch ordering preserved for downstream reproducibility.
+- Integrates with OBITools4’s `obidefault`, `obiutils` for config and resource management.
+
+> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
@@ -0,0 +1,32 @@
+# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
+
+The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
+
+- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
+
+- **Key Fields**:
+  - `outputs`: A map from integer class codes to output streams (`IBioSequence`).
+  - `news`: An unbuffered channel emitting class codes when new output streams are created.
+  - `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
+
+- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
+
+- **Batching Strategy**:
+  - Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
+  - Batches are flushed automatically and on finalization.
+
+- **Asynchronous Processing**:
+  - The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
+  - Outputs are closed only after all sequences have been processed.
+
+- **Notifications**:
+  - The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
+
+- **Error Handling**:
+  - `Outputs(key)` returns an error if the requested key has no associated output.
+
+- **Integration**:
+  - Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
+  - Uses `SortBatches()` on the input iterator to ensure ordered processing.
+
+In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
+# `ExtractTaxonomy` Function — Semantic Description
+
+The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
+
+- **Input**:  
+  - A pointer to `IBioSequence`, representing a sequence iterator over biological data.  
+  - A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
+
+- **Process**:  
+  - Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.  
+  - For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.  
+  - Stops and returns immediately upon encountering the first error during taxonomy extraction.
+
+- **Output**:  
+  - Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).  
+  - Returns `nil` error on success; otherwise, returns the first encountered error.
+
+- **Semantic Role**:  
+  Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
+
+- **Dependencies**:  
+  Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
+
+This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
@@ -0,0 +1,28 @@
+# `IFragments` Functionality Overview
+
+The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
+
+## Core Parameters
+- `minsize`: Minimum sequence length to skip fragmentation.
+- `length`: Desired fragment size (in bases/amino acids).
+- `overlap`: Number of overlapping residues between consecutive fragments.
+- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
+
+## Workflow
+1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
+2. **Parallel Fragmentation**:
+   - Each worker processes a subset of batches independently using goroutines.
+   - For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
+   - The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
+3. **Resource Management**:
+   - Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
+   - Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
+
+## Key Features
+- **Overlap handling**: Ensures contiguous coverage without gaps.
+- **Memory efficiency**: Uses recycling and batched output.
+- **Scalability**: Leverages Go concurrency via `nworkers`.
+- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
+
+## Use Case
+Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
@@ -0,0 +1,29 @@
+# Memory-Limited Biosequence Iterator
+
+This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
+
+## Core Functionality
+
+- **`LimitMemory(fraction float64)`**  
+  Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
+
+- **Memory Monitoring**  
+  Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
+
+- **Backpressure Mechanism**  
+  While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
+
+- **Logging**  
+  Warns via `obilog.Warnf` when:
+  - Memory pressure persists (every ~1000 yields),
+  - Or wait duration becomes unusually long (>10,000 yielding cycles).
+
+- **Concurrency Model**  
+  - A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
+  - A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
+
+## Semantic Behavior
+
+- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
+- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
+- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
@@ -0,0 +1,19 @@
+# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
+
+This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
+
+- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**  
+  - Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
+  - Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
+  - For each batch:
+    - Collects up to `batchsize` sequences via the input iterator.
+    - Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
+    - Wraps merged results into a `BioSequenceBatch` with ordering metadata.
+  - Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
+
+- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**  
+  - A *pipeline combinator* (higher-order function), enabling functional composition.
+  - Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
+
+**Semantic Purpose**:  
+Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
@@ -0,0 +1,35 @@
+# `NumberSequences` Function — Semantic Description
+
+The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
+
+## Core Functionality
+
+- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
+- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
+- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
+
+## Parallelization Strategy
+
+- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
+- **Reordering mode**: If `forceReordering` is true:
+  - Input iterator is batch-sorted (`SortBatches()`).
+  - Parallelism disabled (1 worker) to ensure deterministic numbering order.
+
+## Implementation Details
+
+- Each goroutine processes its own split of the input iterator.
+- A shared `next_first` counter tracks the next available sequence number globally.
+- Locking ensures atomic increment and assignment, preventing race conditions.
+
+## Output
+
+Returns a new `IBioSequence` iterator:
+- Contains the same sequence batches (possibly reordered if sorted).
+- Each `BioSequence` object now carries a `"seq_number"` attribute.
+- Paired sequences are co-numbered and marked accordingly.
+
+## Use Cases
+
+- Preparing data for downstream tools requiring unique sequence IDs.
+- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
+- Reproducible numbering across pipeline stages or restarts.
@@ -0,0 +1,17 @@
+# Paired-End Sequence Handling in `obiiter`
+
+This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
+
+- `BioSequenceBatch` methods:
+  - **`IsPaired()`**: Checks whether the batch contains paired reads.
+  - **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
+  - **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
+  - **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
+
+- `IBioSequence` (iterator) methods:
+  - **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
+  - **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
+  - **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
+  - **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
+
+All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
@@ -0,0 +1,17 @@
+# Semantic Description of `obiiter` Package Features
+
+This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
+
+- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.  
+- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
+
+- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
+
+- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
+
+- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.  
+  - Uses goroutines to ensure non-blocking parallel consumption.
+  - Ensures proper lifecycle management: closing the second stream when the first is closed.  
+  - Preserves paired-end status (`MarkAsPaired`) if applicable.
+
+Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
@@ -0,0 +1,28 @@
+# `MakeSetAttributeWorker` Functionality Overview
+
+The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
+
+- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
+
+- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
+
+- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequence’s metadata.
+
+- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
+
+- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
+
+- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
+
+- **Use case example**:  
+  ```go
+  worker := MakeSetAttributeWorker("species")
+  seq = worker(seq) // annotates `seq` with species-level taxon
+  ```
+
+- **Assumptions**:  
+   - `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.  
+   - Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).  
+   - Sequences carry mutable taxonomic metadata.
+
+- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
@@ -0,0 +1,31 @@
+# `Speed` Functionality Description
+
+The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
+
+## Core Features
+
+- **Non-intrusive progress bar**:  
+  The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
+
+- **Conditional rendering**:  
+  The progress bar is only shown when:
+    - `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
+    - stderr is connected to a terminal (`os.ModeCharDevice`),
+    - stdout is *not* piped (to avoid interfering with file output).
+
+- **Batch-aware counting**:  
+  Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
+
+- **Paired-end support**:  
+  If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
+
+- **Pipeable wrapper**:  
+  `SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
+
+## Implementation Highlights
+
+- Uses goroutines to decouple iteration and progress updates.
+- Automatically closes the output iterator when input ends (`WaitAndClose()`).
+- Prints a final newline to stderr upon completion.
+
+This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
@@ -0,0 +1,20 @@
+# Semantic Description of `obiiter` Package Functionalities
+
+This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
+
+- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:  
+  Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
+
+- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:  
+  Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
+
+- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:  
+  Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
+
+- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:  
+  Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
+
+- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:  
+  Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
+
+All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.
@@ -0,0 +1,33 @@
+# `obiitercsv`: CSV Record Iterator for Streaming and Batch Processing
+
+This Go package provides a thread-safe, channel-based iterator (`ICSVRecord`) for streaming and processing CSV records in batches. It supports ordered batch handling, concurrent access via mutexes, and dynamic header management.
+
+## Core Types
+
+- **`CSVHeader`**: A slice of strings representing column names.
+- **`CSVRecord`**: A map from field name to value (`map[string]interface{}`).
+- **`CSVRecordBatch`**: A batch of records with metadata: `source`, `order`, and the actual data slice.
+
+## Key Features
+
+- **Streaming via Channels**: Records are consumed as `CSVRecordBatch` items through a channel, enabling asynchronous producers/consumers.
+- **Ordered Processing**: Batches include an `order` field, used by `SortBatches()` to reconstruct sequential order even when received out-of-order.
+- **Thread Safety**: Uses `sync.RWMutex`, atomic operations (`batch_size`), and `abool.AtomicBool` for flags like `finished`.
+- **Iterator Protocol**: Implements standard methods:  
+  - `Next()` to advance,  
+  - `Get()` to retrieve current batch,  
+  - `PushBack()` for re-queuing the last record.
+- **Batch Management**:  
+  - `SetHeader()` / `AppendField()`: dynamic header updates.  
+  - `Split()`: creates a new iterator sharing the same channel but with independent locking.
+- **Lifecycle Control**:  
+  - `Add()` / `Done()`: track active goroutines (via `sync.WaitGroup`).  
+  - `WaitAndClose()` ensures all data is flushed before closing the channel.
+
+## Utility Methods
+
+- **`NotEmpty()`, `IsNil()`**: Check batch validity.
+- **`Consume()`**: Drains the iterator (e.g., for side-effect processing).
+- **`SortBatches()`**: Reorders batches by `order`, buffering out-of-sequence ones.
+
+Designed for bioinformatics pipelines (e.g., OBITools4), it enables scalable, memory-efficient CSV processing with strict ordering guarantees.
@@ -0,0 +1,36 @@
+# Semantic Description of `obikmer` Package
+
+This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
+
+## Core Functionalities
+
+1. **`Count4Mer(seq, buffer, counts)`**  
+   Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.  
+   - Encodes each 4-mer into an integer (0–255) using `Encode4mer`.  
+   - Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.  
+   - Reuses or allocates the `counts` buffer as needed.
+
+2. **`Common4Mer(count1, count2)`**  
+   Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.  
+   Used to measure shared content between sequences.
+
+3. **`Sum4Mer(count)`**  
+   Returns the total number of 4-mers in a profile (i.e., sum over all entries).
+
+## Distance & Similarity Bounds
+
+4. **`LCS4MerBounds(count1, count2)`**  
+   Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:  
+   - **Lower bound**: `common_kmers + (3 if common > 0 else 0)`  
+   - **Upper bound**: `min(total1, total2) + 3 − ceil((min_total – common)/4)`  
+   Leverages the fact that overlapping k-mers constrain possible alignments.
+
+5. **`Error4MerBounds(count1, count2)`**  
+   Estimates bounds for *alignment errors* (e.g., mismatches + indels):  
+   - **Upper bound**: `max_total − common_kmers + 2 * floor((common_kmers + 5)/8)`  
+   - **Lower bound**: `ceil(upper_bound / 4)`  
+   Provides fast, approximate error estimates without full alignment.
+
+## Use Case
+
+Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
@@ -0,0 +1,44 @@
+# Semantic Description of the `obikmer` Package
+
+This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
+
+### Core Functionalities
+
+- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
+- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
+- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
+
+### Graph Operations
+
+- **Node Queries**:  
+  - `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.  
+  - `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
+- **Path Exploration**:  
+  - `MaxPath()`: Greedily traces the highest-weight path from a head.  
+  - `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).  
+  - `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
+
+### Consensus & Filtering
+
+- **Consensus Generation**:  
+  - `BestConsensus()` returns a sequence from the greedy max-weight path.  
+  - `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
+- **Weight Statistics**:  
+  - `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.  
+  - `FilterMinWeight(min)` removes low-count nodes.
+- **Decoding**:  
+  - `DecodeNode()` converts a k-mer index to its DNA string.  
+  - `DecodePath()` reconstructs the full consensus from a path.
+
+### I/O & Diagnostics
+
+- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
+- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
+- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
+
+### Dependencies & Design
+
+- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Go’s stdlib.
+- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
+
+Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
@@ -0,0 +1,35 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
+
+## Core Functionalities
+
+1. **Nucleotide Encoding**  
+   - `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:  
+     `0→A`, `1→C`, `2→G`, `3→T/U`.  
+     Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).  
+     Uses a lookup table for O(1) performance.
+
+2. **4-mer Encoding**  
+   - `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.  
+     Each k-mer is encoded as an unsigned byte (0–255), where each nucleotide contributes 2 bits.  
+     Supports optional buffer reuse for memory efficiency.
+
+3. **4-mer Indexing**  
+   - `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0–255) to all its occurrence positions in the sequence.  
+     Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
+
+4. **Fast Sequence Comparison**  
+   - `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.  
+     - Uses precomputed 4-mer index of a reference sequence and encodes the query.  
+     - Counts co-occurring 4-mers across all possible shifts (`refpos − queryPos`).  
+     - Computes raw and relative scores (normalized by alignment length).  
+     - Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
+
+## Design Highlights
+
+- **Memory-aware**: Supports buffer reuse to minimize allocations.  
+- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).  
+- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.  
+
+Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
@@ -0,0 +1,39 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
+
+## Core Encoding & Decoding
+
+- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
+- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
+
+## Iterators (Memory-Efficient Streaming)
+
+- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
+- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for **odd k ≤ 31**.
+
+## Error Handling & Markers
+
+- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
+
+## Reverse Complement & Circular Normalization
+
+- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
+- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
+- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
+
+## Counting & Math Utilities
+
+- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreau’s necklace formula**, with Euler's totient function and divisor enumeration.
+
+## Performance & Safety
+
+- All functions avoid heap allocations where possible (reusing buffers).
+- Panics on invalid `k` or length mismatches for correctness.
+- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
+
+## Use Cases
+
+- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
+- Error-aware k-mer filtering in sequencing pipelines
+- Low-complexity region detection via circular entropy normalization
@@ -0,0 +1,36 @@
+# Obikmer: Efficient K-mer Encoding and Manipulation in Go
+
+This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
+
+## Core Functionalities
+
+### K-mer Encoding (`EncodeKmers`, `IterKmers`)
+Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
+
+### Reverse Complement (`ReverseComplement`)
+Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
+
+### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
+Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
+
+### Super *k*-mers Extraction (`ExtractSuperKmers`)
+Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
+
+### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
+Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
+
+## Key Features
+
+- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
+- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
+- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
+- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
+
+## Use Cases
+
+- Genome assembly &DBG construction  
+- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)  
+- Error-aware k-mer counting & filtering  
+- Strand-unbiased sequence comparison  
+
+All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
@@ -0,0 +1,31 @@
+# Semantic Description of `obikmer` Entropy Functions
+
+The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
+
+## Core Functionality
+
+- **`KmerEntropy(kmer, k, levelMax)`**:  
+  Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.  
+  - Decodes the encoded *k*-mer (2 bits/base) into a DNA string.  
+  - For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.  
+  - Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.  
+  - Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
+
+- **`KmerEntropyFilter`**:  
+  A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:  
+  - Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.  
+  - Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).  
+  - **Not goroutine-safe** — each thread must instantiate its own filter.
+
+- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:  
+  Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.  
+
+- **`Accept(kmer)` / `Entropy(kmer)`**:  
+  - `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).  
+  - `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
+
+## Design Highlights
+
+- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).  
+- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.  
+- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
@@ -0,0 +1,37 @@
+# K-Way Merge for Sorted k-mer Streams
+
+This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
+
+## Core Components
+
+- **`mergeItem`**: Stores a value and its source reader index for heap operations.
+- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
+- **`KWayMerge`**: Main struct managing the heap and input readers.
+
+## Key Functionality
+
+- **Initialization (`NewKWayMerge`)**:
+  - Takes a slice of `*KdiReader`, each expected to yield sorted values.
+  - Preloads the heap with one value from each reader.
+
+- **Streaming Output (`Next`)**:
+  - Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
+  - Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
+
+- **Cleanup (`Close`)**:
+  - Closes all underlying `KdiReader`s and returns the first encountered error.
+
+## Use Case
+
+Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
+- Efficient deduplication with multiplicity tracking.
+- Scalable union/intersection operations on large *k*-mer sets.
+
+## Complexity
+
+| Operation | Time       |
+|-----------|------------|
+| `Next()`  | *O(log k)* (heap ops per unique value) |
+| Init      | *O(k)*     |
+
+Where `k` = number of input readers.
@@ -0,0 +1,27 @@
+# K-Way Merge Functionality in `obikmer`
+
+This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
+
+## Key Features
+
+- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
+- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
+- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
+- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
+- **Robust Test Coverage**: Includes unit tests for:
+  - Basic merging with overlapping and non-overlapping values.
+  - Single-stream input (degenerate case).
+  - Empty streams handling.
+  - All identical k-mers across inputs.
+
+## API Highlights
+
+- `NewKdiReader(path)` — opens a `.kdi` file for reading.
+- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
+- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
+- `.Next()` → `(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
+- `.Close()` — cleans up resources.
+
+## Use Case
+
+Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each sample’s k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
@@ -0,0 +1,27 @@
+# KDI Reader: Streaming Delta-Varint Decoding for k-mers
+
+The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
+
+## Core Features
+
+- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
+- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
+- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
+- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
+- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
+
+## Key API
+
+- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
+- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
+- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
+- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
+- `Count()` / `Remaining()` → total and unread k-mers.
+- `Close()` → releases file handle.
+
+## Design Highlights
+
+- Uses 64 KB buffer for efficient I/O.
+- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
+- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
+- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
@@ -0,0 +1,34 @@
+# KDI File Format and API
+
+The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
+
+## Core Features
+
+- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
+- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
+- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
+
+## File Structure
+
+A `.kdi` file contains:
+1. **Magic header** (4 bytes): Identifies the format.
+2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
+3. **First value** (8 bytes, `uint64`): Initial k-mer.
+4. **Delta-encoded tail**: `(n−1)` deltas, each encoded as a `varint`.
+
+## API
+
+- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
+- **`Writer.Count()`**: Returns the number of written items before closing.
+- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
+- **`Reader.Count()`**: Returns total stored count.
+
+## Tests Validate
+
+1. Basic round-trip with diverse values (including large `uint64`s).
+2. Empty and single-k-mer files.
+3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
+4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
+5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
+
+The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
@@ -0,0 +1,38 @@
+# KDI File Format and Writer
+
+The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
+
+## Core Format (`.kdi`)
+
+- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
+- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
+- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
+- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
+
+## Writer (`KdiWriter`)
+
+- **Strict ordering**: K-mers must be written in *strictly increasing order*.
+- Efficient buffering via `bufio.Writer` (64 KB buffer).
+- Internally tracks:
+  - Current k-mer count,
+  - Previous value (for delta computation),
+  - Bytes written in data section.
+- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
+
+## Companion Index (`.kdx`)
+
+- Written automatically on `Close()` if indexing entries exist.
+- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
+- Enables efficient random access without full file scan.
+
+## Usage Pattern
+
+```go
+w, _ := obikmer.NewKdiWriter("data.kdi")
+for _, kmer := range sortedKMers {
+    w.Write(kmer)
+}
+w.Close()  // finalizes header, writes .kdx index
+```
+
+The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
@@ -0,0 +1,29 @@
+# KDX Index Format and Functionality
+
+The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
+
+## Core Concepts
+
+- **Magic bytes**: `KDX\x01` validates file integrity.
+- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
+- **Entry structure**: Each entry stores:
+  - `kmer`: the k-mer value at that index position.
+  - `offset`: absolute byte offset in the corresponding `.kdi` file.
+
+## Key Operations
+
+- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
+- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
+  - Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
+  - Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
+- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
+- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
+
+## Performance
+
+- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
+- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
+
+## Design Philosophy
+
+Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
@@ -0,0 +1,14 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
+
+- `QueryEntry` represents a single canonical k‑mer with its origin: sequence index and 1-based position.
+- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
+- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each k‑mer to a partition via modulo hashing, and sorts buckets by k‑mer value.
+- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
+- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
+  - Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
+  - Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
+- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and k‑mer stream in lockstep.
+
+The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
@@ -0,0 +1,49 @@
+# `obikmer` K-mer Set Group Builder — Functional Overview
+
+The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
+
+## Core Features
+
+- **K-mer & Minimizer Configuration**:  
+  Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
+
+- **Functional Options for Filtering**:  
+  - `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).  
+  - `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.  
+  - `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).  
+  - `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
+
+- **Concurrent & Pipeline-Aware Processing**:  
+  Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
+
+- **Partitioned I/O & Thread Safety**:  
+  Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
+
+## Workflow
+
+1. **Build Phase**:  
+   - Input sequences → super-kmers extracted via minimizer-based partitioning.  
+   - Super-kmers written to `.build/set_*/part_*.skm`.
+
+2. **Finalization (`Close()`)**:  
+   - `.skm` files loaded → canonical k-mers extracted.  
+   - K-mers sorted, counted (frequency spectrum), and filtered per config.  
+   - Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.  
+   - Metadata (`metadata.toml`) generated; `.build/` cleaned.
+
+3. **Append Mode**:  
+   `AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
+
+## Output Artifacts
+
+- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.  
+- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).  
+- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.  
+- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
+
+## Design Highlights
+
+- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.  
+- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.  
+- **Robust error handling**: Early termination on first failure; cleanup of partial state.
+
@@ -0,0 +1,44 @@
+# K-mer Set Group Builder — Semantic Description
+
+This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
+
+### Core Functionalities
+
+- **K-mer Set Group Construction**:  
+  `NewKmerSetGroupBuilder` creates a builder configured with:
+  - *k* (k-mer length),
+  - *m* (minimal unique substring for partitioning),
+  - number of sets (`nSets`),
+  - and optional parameters like `WithMinFrequency`.
+
+- **Sequence Ingestion**:  
+  Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
+  - Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
+  - Deduplicated and optionally filtered by occurrence frequency.
+
+- **Persistence & Round-Trip**:  
+  `builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).  
+  `OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
+
+- **Metadata & Attributes**:  
+  Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
+
+- **Efficient Iteration**:  
+  The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
+
+- **Frequency Filtering**:  
+  `WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
+
+- **Multi-set Support**:  
+  Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
+
+### Testing Coverage
+
+Comprehensive unit tests validate:
+- Basic construction & correctness,
+- Multi-sequence ingestion and deduplication,
+- Frequency-based inclusion/exclusion logic,
+- Cross-set isolation (`nSets > 1`),
+- Metadata round-trip integrity.
+
+This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
--- a/Show More
+++ b/Show More