⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+48
View File
@@ -0,0 +1,48 @@
# `neural-ensemble` — A Lightweight Library for Modular Neural Ensemble Learning
The `neural-ensemble` package provides tools to build, train, evaluate, and deploy ensembles of neural networks with minimal boilerplate. It emphasizes modularity, reproducibility, and scalability—supporting both homogeneous (e.g., multiple ResNets) and heterogeneous ensembles (mix of CNNs, Transformers, MLPs)—while offering unified interfaces for data handling, training orchestration, and uncertainty quantification.
## Core Functionalities
### 1. **Model Composition**
- `Ensemble`: A container class to manage multiple models (heterogeneous or homogeneous), supporting dynamic model registration, weighted averaging, voting, and stacking.
- `ModelConfig`: A dataclass to declaratively specify model architecture (e.g., backbone, input shape), training hyperparameters, and checkpoint paths.
### 2. **Training & Orchestration**
- `EnsembleTrainer`: Handles distributed or sequential training of ensemble members, with support for early stopping, learning rate scheduling per member, and custom loss weighting.
- `TrainerCallback`: Abstract base for implementing logging, checkpointing, or metric tracking hooks.
### 3. **Data Handling**
- `EnsembleDataset`: Wraps any PyTorch-compatible dataset and automatically replicates inputs across all ensemble members (with optional per-member augmentation).
- `EnsembleDataModule`: Lightning-compatible data module for seamless integration with PyTorch Lightning workflows.
### 4. **Inference & Aggregation**
- `EnsemblePredictor`: Provides `.predict()` and `.forward_ensemble()`, supporting:
- *Hard/soft voting* (classification)
- *Mean/variance aggregation* (regression)
- *Monte Carlo dropout & deep ensembles* for uncertainty estimation
- `UncertaintyMetrics`: Computes ECE, NLL, Brier score, and predictive entropy.
### 5. **Evaluation & Calibration**
- `EnsembleEvaluator`: Runs comprehensive evaluation across members and the ensemble, reporting per-member vs. aggregate metrics.
- `CalibrationWrapper`: Applies temperature scaling or isotonic regression to calibrate ensemble outputs.
### 6. **Serialization & Deployment**
- `Ensemble.save()` / `.load()`: Persists full ensemble state (weights, configs) to disk.
- `Ensemble.to_torchscript()`: Exports the ensemble for production inference (e.g., via TorchServe or ONNX).
## Key Design Principles
- **Minimal dependencies**: Built on top of PyTorch, with optional integrations (Lightning, HuggingFace).
- **No hidden state**: All ensemble behavior is controlled via explicit configuration.
- **Extensible hooks**: Custom aggregation rules, losses, or metrics can be injected via inheritance.
## Example Workflow
```python
ensemble = Ensemble([
ModelConfig(backbone="resnet18", input_shape=(3, 224, 224)),
ModelConfig(backbone="vit_b_16", input_shape=(3, 224, 224)),
])
trainer = EnsembleTrainer(ensemble=ensemble)
trainer.fit(train_loader, val_loader)
preds, uncertainties = EnsemblePredictor(ensemble).predict(test_loader, return_uncertainty=True)
```
+22
View File
@@ -0,0 +1,22 @@
# `obialign` Package: Sequence Alignment Utilities
The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
- At mismatches, it retains the higher-quality base.
- When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
- Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
- Optionally records mismatch statistics in sequence attributes.
- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
- **Alignment Path Format**: The path is a sequence of signed integers encoding:
- Negative steps → deletions in seqB (insertion in A),
- Positive steps → insertions in B,
- Consecutive pairs encode match/mismatch runs.
This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
@@ -0,0 +1,30 @@
# Semantic Description of `obialign` Backtracking Module
The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
## Core Functionality
- **Input**:
- `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.
- `lseqA`, `lseqB`: Lengths of sequences A and B.
- `path`: Pre-allocated slice to store the traceback path.
- **Output**: A compact representation of alignment steps, alternating between:
- Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
- Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
## Algorithm Highlights
- **Reverse traversal** from `(lseqA1, lseqB1)` to origin.
- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
## Encoded Path Semantics
Each pair in the returned slice encodes:
- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
## Use Case
Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obialign` Package
This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
## Key Functionalities
- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
- **Log-space arithmetic helpers**:
- `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
- `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 exp(a))` and `log(exp(a) exp(b))`, critical for probability transformations.
- **Match/mismatch scoring (`_MatchScoreRatio`)**:
- Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
- Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
- **Precomputed scoring matrices**:
- `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
- `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
- **Thread-safe initialization**:
- `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
+23
View File
@@ -0,0 +1,23 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
- **Core Encoding Strategy**:
A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
- **`encodeValues(score, length int, out bool)`**:
Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
- **`decodeValues(value uint64)`**:
Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
- **Utility Bitwise Helpers**:
- `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
- `_incscore(value)`: increments score by `1 << wsize`.
- `_setout(value)`: clears the out-flag, marking value as *not* terminal.
- **Predefined Constants**:
- `_empty`: neutral state (score=0, length=0).
- `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
+42
View File
@@ -0,0 +1,42 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
## Core Algorithm
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
## Scoring Scheme
- **Match**: +1 point
- **Mismatch or gap (indel)**: 0 points
## Key Functions
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
- Computes LCS score and alignment length between raw byte sequences.
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
- Designed for standard biosequence inputs.
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
## Features
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
## Use Cases
- Molecular barcode/UMI clustering
- Read-to-reference alignment in amplicon sequencing
- Similarity filtering of biological sequences
@@ -0,0 +1,15 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
- **Lookup tables**:
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
+19
View File
@@ -0,0 +1,19 @@
## `obialign` Package: Semantic Overview (≤50 lines)
The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
- `D1Or0(...)`: Returns a 4-tuple:
- **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
- **`int` (second)**: Position of the differing site (`-1` if identical).
- **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
**Algorithmic strategy:**
1. Early rejection if length difference exceeds 1.
2. Forward scan until first mismatch → identifies left boundary of divergence.
3. Backward scan from ends to find rightmost match boundary.
4. Validates whether the mismatch region allows exactly one edit:
- Single substitution: equal lengths, single divergent position.
- Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
@@ -0,0 +1,29 @@
# `LocatePattern` Functionality Overview
The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
- **Input**:
- `id`: identifier for logging/error reporting.
- `pattern []byte`: the query sequence (e.g., primer).
- `sequence []byte`: the target read/contig.
- **Constraints**:
- Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).
- **Scoring Scheme**:
- Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).
- Mismatch/Gap: `-1`.
- **Algorithm Features**:
- End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.
- Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).
- Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).
- Backtracks from the bottom-right to find optimal local alignment start/end coordinates.
- **Output**:
- `start`: starting index in `sequence`.
- `end+1`: ending index (exclusive) of best match.
- Error count: `-score`, i.e., number of mismatches/gaps in alignment.
- **Use Case**:
Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
@@ -0,0 +1,37 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
## Core Functionalities
### 1. **Memory Arena Management**
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
### 2. **Dynamic Programming Alignment Functions**
Implements three specialized global alignment variants using NeedlemanWunsch with affine gap penalties (scaled per mismatch):
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
### 3. **Scoring & Quality Integration**
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
- A user-defined `scale` factor to modulate mismatch penalties.
### 4. **Fast Heuristic Pre-Alignment**
The main `PEAlign` function integrates a kmer-based fast pre-screening:
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
### 5. **Backtracking & Path Output**
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
### Use Case
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
+58
View File
@@ -0,0 +1,58 @@
# Semantic Description of `obialign.ReadAlign`
The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
## Core Functionality
- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:
- `gap`: gap penalty (linear)
- `scale`: scaling factor for quality scores
- `delta`: extension buffer around initial overlap estimate
- `fastScoreRel`: use relative vs absolute k-mer matching score
## Algorithm Overview
1. **Preprocessing & Initialization**
- Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).
2. **Fast Overlap Estimation via 4-mer Indexing**
- Builds a k-mer index of `seqA` using `obikmer.Index4mer`.
- Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.
- Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).
3. **Overlap Computation**
- Determines overlap length `over` based on shift:
```text
over = |seqA| - shift if shift > 0
|seqB| + shift if shift < 0
min(|seqA|,|seqB)| otherwise
```
4. **Dynamic Programming Alignment**
- If overlap is *not* identical (`fastCount + 3 < over`):
- Extracts subregions with `delta`-buffered boundaries.
- Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.
- Backtracks via `_Backtracking` to produce alignment path.
- Else (near-perfect overlap):
- Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.
- Returns trivial path `[extra5, partLen]`.
## Output
Returns:
| Index | Type | Meaning |
|-------|----------|---------|
| 0️⃣ | `int` | Final alignment score (weighted by quality) |
| 1️⃣ | `[]int` | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
| 2️⃣ | `int` | K-mer match count (`fastCount`) |
| 3️⃣ | `int` | Overlap length (`over`) |
| 4️⃣ | `float64` | K-mer-based score (`fastScore`) |
| 5️⃣ | `bool` | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
## Key Design Highlights
- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.
- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.
- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).
- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
+25
View File
@@ -0,0 +1,25 @@
# Apat Package: Pattern Matching for Biological Sequences
The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
## Core Types
- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).
- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
## Key Functions & Methods
- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.
- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).
- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.
- `IsMatching(...)`: Boolean check for presence of at least one match in a region.
- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.
- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.
- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).
- `Free()`, `Len()`: Explicit memory cleanup and length queries.
## Implementation Notes
Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
+32
View File
@@ -0,0 +1,32 @@
# Semantic Description of `obiapat` Package Functionality
The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
## Core Functionality
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**
Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
- `pattern`: A string where:
- Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
- Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
- Exclamation `!` marks positions where **errors** (substitutions) are permitted.
- `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
- `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
## Behavior & Semantics
- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
- Supports three modes:
- **Exact matching** (`errormax = 0`, `allowsIndel = false`).
- **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
- **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
## Testing Coverage
The provided test suite validates:
- Valid pattern parsing across different configurations.
- Correct handling of `nil` vs. non-nil output pointers.
- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
+27
View File
@@ -0,0 +1,27 @@
# PCR Simulation Module (`obiapat`)
This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
## Key Functionalities
- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
## Core API
- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
## Implementation Details
- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
- Handles circular topology by wrapping indices around sequence boundaries.
- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
- Logs critical errors with `logrus`; debug-level details for amplicon generation.
Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
+23
View File
@@ -0,0 +1,23 @@
## Semantic Description of `IsPatternMatchSequence`
The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
### Core Functionality:
- **Input Parameters**
- `pattern`: A regular expression-like string describing the target pattern.
- `errormax`: Maximum allowed mismatches (substitutions only by default).
- `bothStrand`: If true, also search on the reverse-complement strand.
- `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
- **Internal Workflow**
- Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.
- Computes its reverse complement for dual-strand matching.
- Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
- **Matching Logic**
- Converts input sequence to `apat` format.
- Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.
- Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
### Semantic Use Case:
Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.
+15
View File
@@ -0,0 +1,15 @@
# `ISequenceChunk` Function — Semantic Description
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
- Optional parameters allow fine-tuning:
- `dereplicate`: enables deduplication of identical sequences.
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
Key capabilities include:
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
# `ISequenceChunkOnMemory` Function — Semantic Description
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
Key features:
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
Input:
- An iterator (`obiiter.IBioSequence`) of raw sequences.
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
Output:
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obichunk` Package
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
## Core Concepts
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
## Supported Functionalities
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
- **Batch Processing Control**:
- `OptionBatchCount(number)` sets the number of batches.
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
## Design Highlights
- Functional options pattern for extensibility and readability.
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
+29
View File
@@ -0,0 +1,29 @@
# Semantic Description of `obichunk.ISequenceSubChunk`
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
## Core Functionality
- **Input**:
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
- **Processing**:
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
- For each incoming `BioSequenceBatch`:
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
- If the batch has ≤1 sequence, its passed through unchanged (but reordered with a new order ID).
- **Ordering Mechanism**:
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
- **Output**:
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
## Semantic Purpose
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
+45
View File
@@ -0,0 +1,45 @@
# Semantic Description of `IUniqueSequence` Functionality
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
## Core Workflow
1. **Input Processing**
Accepts an input sequence iterator and optional configuration via `WithOption`.
2. **Parallelization Strategy**
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
3. **Data Splitting Phase**
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
- Ensures deterministic chunking for reproducibility.
4. **Storage Choice**
- *In-memory*: via `ISequenceChunkOnMemory`.
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
5. **Uniqueness Classification**
- Builds a composite classifier combining:
- Sequence identity (`SequenceClassifier`)
- Optional annotation categories (e.g., sample, primer), with NA handling.
- If no annotations are specified, only raw sequence identity is used.
6. **Singleton Filtering**
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
7. **Parallel Dereplication**
- Spawns worker goroutines to process chunks.
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
8. **Output Merging**
- Aggregates results using `IMergeSequenceBatch`, preserving:
- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
## Key Features
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- **Configurable**: Via functional options (`Options`).
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
+28
View File
@@ -0,0 +1,28 @@
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
## Core Components
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
- Scans each sequence *forward* and its reverse complement.
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
- Attaches match counts as sequence attributes.
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
Compiles a *single* matcher and returns a predicate function.
- Returns `true` if the number of matches ≥ `minMatches`.
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
## Technical Highlights
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
## Use Cases
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
- Filtering or annotating sequences based on pattern presence/abundance.
+34
View File
@@ -0,0 +1,34 @@
# ObiDefault Package: Batch Configuration Module
This Go module provides centralized configuration for sequence batching in Obitools, supporting both **count-based** and **memory-aware** batch processing.
## Core Features
- `_BatchSize` / `SetBatchSize()`
Defines and configures the *minimum* number of sequences per batch (default: `1`).
Used internally as `minSeqs` in `RebatchBySize`.
- `_BatchSizeMax()` / `SetBatchSizeMax()`
Sets the *maximum* sequences per batch (default: `2000`). Batches are flushed upon reaching this limit, regardless of memory.
- **CLI & Environment Integration**
Batch size is determined by `--batch-size` CLI flag and/or the `OBIBATCHSIZE` environment variable (via parsing logic not shown here but implied by comments).
- `_BatchMem()` / `SetBatchMem(n int)`
Configures the *maximum memory per batch* (default: `128 MB`). A value of `0` disables memory-based batching, falling back to pure count-based logic.
- `_BatchMemStr()`
Stores the *raw CLI string* passed to `--batch-mem` (e.g., `"256M"`, `"1G"`), enabling human-readable input parsing elsewhere.
## Utility Functions
- `BatchSizePtr()`, `BatchMemPtr()`
Expose pointers to internal variables for direct modification or inter-process sharing.
- `BatchSizeMaxPtr()`, `BatchMemStrPtr()`
Provide read/write access to max-size and raw memory string values.
## Design Intent
- Separates **configuration** (defaults, CLI/env parsing) from **processing logic**, enabling modular and testable batch handling.
- Supports both scalable, large-scale processing (via count limits) and memory-constrained environments (via soft RAM caps).
@@ -0,0 +1,35 @@
# Output Compression Control Module
This Go package (`obidefault`) provides a simple, global configuration mechanism for toggling output compression behavior across an application.
## Core Features
- **Global Compression Flag**: A package-level boolean variable `__compress__` (default: `false`) controls whether output should be compressed.
- **Read Access**:
- `CompressOutput()` returns the current compression setting as a boolean.
- **Write Access**:
- `SetCompressOutput(b bool)` updates the compression flag to a new value.
- **Pointer Access**:
- `CompressOutputPtr()` returns a pointer to the internal flag, enabling indirect modification (e.g., for UI bindings or reflection-based updates).
## Design Intent
- Minimal, side-effect-free API.
- Thread-safety *not* guaranteed — intended for use in single-threaded initialization or controlled environments.
- Encapsulation via unexported variable `__compress__`, enforced through accessor functions.
## Typical Usage
```go
// Enable compression globally:
obidefault.SetCompressOutput(true)
if obidefault.CompressOutput() {
// Apply compression logic (e.g., gzip, brotli)
}
```
## Notes
- The double underscore prefix (`__compress__`) signals internal/private status (convention, not enforced).
- Designed for runtime configurability without recompilation.
+38
View File
@@ -0,0 +1,38 @@
# `obidefault` Package — Semantic Overview
This minimal Go package provides a centralized, mutable global flag for controlling warning verbosity across an application.
## Core Functionality
- **`__silent_warning__`**:
A package-level boolean variable (unexported) that determines whether warnings should be suppressed.
- **`SilentWarning() bool`**:
A read-only accessor returning the current state of `__silent_warning__`. Enables safe, non-mutating checks elsewhere in the codebase.
- **`SilentWarningPtr() *bool`**:
Returns a pointer to `__silent_warning__`, allowing external code (e.g., CLI parsers, config loaders) to directly mutate the flag — e.g., `*SilentWarningPtr() = true`.
## Design Intent
- **Simplicity & Centralization**:
Avoids scattering warning-control logic; provides a single source of truth.
- **Flexibility**:
Supports both *read-only* inspection (via `SilentWarning()`) and *global mutation* (via pointer), useful for early initialization phases.
- **Explicit Semantics**:
When `SilentWarning()` returns `true`, all warning-generating code *should* suppress output (implementation responsibility lies outside this package).
## Usage Example
```go
// Suppress warnings globally:
*obidefault.SilentWarningPtr() = true
if !obidefault.SilentWarning() {
log.Println("⚠️ Warning: something happened")
}
```
> **Note**: The double underscore prefix on `__silent_warning__` signals internal/private status, discouraging direct access.
@@ -0,0 +1,33 @@
# Progress Bar Control Module (`obidefault`)
This Go package provides a simple, global mechanism to enable or disable progress bar display across an application.
## Core Functionality
- **`ProgressBar()`**: Returns `true` if progress bars are *enabled* (i.e., when `__no_progress_bar__` is `false`).
- **`NoProgressBar()`**: Returns the current state of `__no_progress_bar__`, i.e., whether progress bars are *disabled*.
- **`SetNoProgressBar(b bool)`**: Sets the global flag `__no_progress_bar__`. Passing `true` disables progress bars; passing `false` enables them.
- **`NoProgressBarPtr()`**: Returns a pointer to the internal `__no_progress_bar__` variable, allowing direct read/write access (e.g., for reflection or UI binding).
## Design Intent
- Centralizes progress bar visibility control in one place.
- Supports both boolean query/set and pointer-based manipulation for flexibility (e.g., CLI flags, config binding).
- Uses a *negative* flag name (`__no_progress_bar__`) internally to default progress bars **on** (i.e., `false` → enabled).
## Usage Example
```go
// Disable progress bars globally:
obidefault.SetNoProgressBar(true)
// Check status:
if !obidefault.ProgressBar() {
log.Println("Progress bars are disabled.")
}
```
## Notes
- Thread-safety is *not* guaranteed; concurrent access should be externally synchronized.
- The double underscore prefix (`__no_progress_bar__`) signals internal/private usage per Go convention (though not enforced).
+26
View File
@@ -0,0 +1,26 @@
# Quality Shift and Read/Write Control Module
This Go package (`obidefault`) provides configurable controls over quality score handling in sequence data processing (e.g., FASTQ files). It defines three global variables and corresponding accessor/mutator functions:
- `_Quality_Shift_Input`: Input quality score offset (default: `33`, i.e., Phred+33/Sanger format).
- `_Quality_Shift_Output`: Output quality score offset (default: `33`), allowing format conversion.
- `_Read_Qualities`: Boolean flag indicating whether quality scores should be parsed/processed (`true` by default).
## Public API
| Function | Purpose |
|---------|--------|
| `SetReadQualitiesShift(shift byte)` | Sets the quality score offset for *input* data (e.g., when reading FASTQ). |
| `ReadQualitiesShift() byte` | Returns the current input quality offset. |
| `SetWriteQualitiesShift(shift byte)` | Sets the quality score offset for *output* data (e.g., when writing FASTQ). |
| `WriteQualitiesShift() byte` | Returns the current output quality offset. |
| `SetReadQualities(read bool)` | Enables/disables reading/processing of quality scores. |
| `ReadQualities() bool` | Returns whether qualities are currently being read/used. |
## Semantic Use Cases
- **Format Interoperability**: Allows seamless conversion between Phred+33 (Sanger), Phred+64, or other quality encodings.
- **Performance Optimization**: Disabling `ReadQualities` skips parsing of quality strings, useful when only sequences are needed.
- **Centralized Configuration**: Global state enables consistent behavior across modules without passing parameters.
All functions are thread-unsafe by design—intended for initialization before concurrent processing begins.
+21
View File
@@ -0,0 +1,21 @@
# `obidefault` Package: Configuration State Management
This Go package provides a centralized, thread-safe(ish) configuration layer for taxonomy-related settings in the OBIDMS (Open Biological and Biomedical Data Management System) framework. It exposes simple getters, setters, and pointer accessors for four core boolean/string flags that control how taxonomic identifiers (taxids) are handled during data processing.
## Core Configuration Flags
- `__taxonomy__`: Stores the currently selected taxonomy (e.g., `"NCBI"`, `"UNIPROT"`).
- `__alternative_name__`: Enables/disables use of alternative taxonomic names (e.g., synonyms).
- `__fail_on_taxonomy__`: If true, processing halts on taxonomy mismatches/errors.
- `__update_taxid__`: If true, taxids are auto-updated to current NCBI/DB versions.
- `__raw_taxid__`: If true, raw (unprocessed) taxids are preserved instead of normalized.
## Public API
- **Getters**: `UseRawTaxids()`, `SelectedTaxonomy()`, `HasSelectedTaxonomy()`, etc., return current values.
- **Pointer Accessors**: e.g., `SelectedTaxonomyPtr()` returns a pointer for direct mutation (advanced use).
- **Setters**: `SetSelectedTaxonomy()`, `SetAlternativeNamesSelected()`, etc., update state.
## Use Case
Typically used at application startup to configure global behavior (e.g., `SetSelectedTaxonomy("NCBI")`, `SetUpdateTaxid(true)`), then referenced by downstream modules during data import, validation, or mapping. Minimalist and explicit—no external dependencies.
+35
View File
@@ -0,0 +1,35 @@
# Obidefault: Parallelism Configuration Module
This Go package (`obideault`) provides a centralized, configurable interface for managing parallel execution parameters—particularly useful in I/O- and CPU-bound workloads.
## Core Concepts
- **CPU-aware defaults**: Automatically detects available cores via `runtime.NumCPU()`.
- **Configurable workers per core**:
- General: `_WorkerPerCore` (default `1.0`)
- Read-specific: `_ReadWorkerPerCore` (`0.25`, i.e., ~1 reader per 4 cores)
- Write-specific: `_WriteWorkerPerCore` (`0.25`)
- **Strict overrides**: Allow hardcoding worker counts via `SetStrictReadWorker()`/`Write...`, bypassing per-core scaling.
## Public API
| Function | Purpose |
|---------|--------|
| `ParallelWorkers()` | Total workers = `MaxCPU() × WorkerPerCore` |
| `Read/WriteParallelWorkers()` | Resolves to strict count if set, else per-core calculation (min 1) |
| `ParallelFilesRead()` | Files read in parallel: defaults to `ReadParallelWorkers()`, overridable |
| Getters (`MaxCPU`, `WorkerPerCore`, etc.) | Expose current settings safely |
| Setters (`Set*`) | Dynamically adjust behavior at runtime |
## Configuration Sources
- **Command-line flags**: e.g., `--max-cpu` or `-m`
- **Environment variable**: `OBIMAXCPU`
## Design Highlights
✅ Decouples resource discovery from policy
✅ Supports both *proportional* (per-core) and *absolute* (strict) worker definitions
✅ Ensures non-zero defaults for critical paths (`ReadParallelWorkers` ≥ 1)
⚠️ **Note**: `WriteParallelWorkers()` contains a likely bug—returns `_StrictReadWorker` in the else branch instead of `StrictWriteWorker`.
+28
View File
@@ -0,0 +1,28 @@
# `obidist` Package: Efficient Symmetric Distance/Similarity Matrix Management
The `*DistMatrix` type provides a memory-efficient, symmetric matrix implementation for distance or similarity data.
- **Storage Strategy**: Only the upper triangle (i < j) is stored, reducing memory usage from *O(n²)* to *n(n1)/2*.
- **Diagonal Handling**: Diagonal entries are fixed (0.0 for distances, 1.0 for similarities); assignments to diagonal indices are silently ignored.
- **Symmetry Guarantee**: `Get(i, j)` and `Set(i, j, v)` automatically handle both (i,j) and (j,i), ensuring consistency.
## Constructors
| Function | Description |
|---------|-------------|
| `NewDistMatrix(n)` / `WithLabels(labels)` | Creates *n×n* distance matrix (diag = 0). |
| `NewSimilarityMatrix(n)` / `WithLabels(labels)` | Creates *n×n* similarity matrix (diag = 1). |
## Core Operations
- `Get(i, j)` / `Set(i, j, v)`: Access/update symmetric entries.
- `Size() int`, `GetLabel(i)` / `SetLabel(i, label)`: Query/mutate element labels.
- `Labels() []string`, `GetRow(i)` / `GetColumn(j)`: Retrieve full rows/columns (as copies).
## Analysis Helpers
- `MinDistance()`, `MaxDistance()``(value, i, j)` of the extremal off-diagonal entry.
- `Copy() *DistMatrix`: Deep copy for immutability-safe operations.
- `ToFullMatrix()``[][]float64`: Converts to dense representation (use sparingly).
Designed for clustering, phylogenetics, or any domain requiring fast symmetric matrix access with minimal footprint.
@@ -0,0 +1,28 @@
# `obidist` Package: Semantic Feature Overview
The `obidist` Go package provides two core data structures for managing **distance** and **similarity matrices**, with built-in guarantees suitable for scientific computing (e.g., clustering, phylogenetics). Key features include:
- **`DistMatrix`**: A symmetric `n×n` matrix representing pairwise distances, where:
- Diagonal entries are *always* `0.0` (self-distance).
- Off-diagonals obey symmetry: `dist(i, j) == dist(j, i)`.
- Automatic enforcement via dedicated `Set()`/`Get()` methods.
- **`SimilarityMatrix`**: A symmetric matrix where:
- Diagonal entries are *always* `1.0`.
- Off-diagonals represent similarity scores (e.g., between `0` and `1`, though not enforced).
- Symmetry is similarly guaranteed.
Both matrix types support:
- **Optional labels**: Associate human-readable identifiers (e.g., sample names) with rows/columns.
- **Safe bounds checking**: Panics on out-of-range access (tested via `defer/recover`).
- **Deep copy support**: Ensures isolation between original and copied instances.
- **Utility methods**:
- `MinDistance()` / `MaxDistance()`: Return extremal values and their indices.
- `GetRow(i)`: Retrieve a full row as a slice (symmetric copy).
- `ToFullMatrix()`: Export the matrix as an immutable 2D slice.
Edge cases are rigorously handled:
- Empty (`n=0`) and singleton (`n=1`) matrices return `(0.0, -1, -1)` for min/max.
- Label mutations do not affect internal state via defensive copying.
All behaviors are validated through comprehensive unit tests, emphasizing correctness and robustness.
@@ -0,0 +1,43 @@
# Semantic Description of `ReadSequencesBatchFromFiles`
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
## Core Functionality
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
## Concurrency Model
- Launches `concurrent_readers` goroutines to process files in parallel.
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
## Streaming Interface
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
## Error Handling & Logging
- Panics on file-open failure (via `log.Panicf`).
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
## Resource Management
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
## Design Intent
- Enables scalable, memory-efficient ingestion of large NGS datasets.
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
## Key Abstractions
| Type/Interface | Role |
|----------------|------|
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
@@ -0,0 +1,36 @@
# `obiformats` Package — Semantic Overview
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
## Core Abstraction
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
```go
func(string, ...WithOption) (obiiter.IBioSequence, error)
```
- It accepts:
- A file path (`string`)
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
- Returns:
- An iterator over biological sequences (`obiiter.IBioSequence`)
- Or an error if the file cannot be opened/parsed
## Semantic Intent
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
## Typical Usage Pattern
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
2. Call it with a file path and optional options.
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
## Design Principles
- **Functional, minimal API**: Single responsibility—reading and iteration.
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
+30
View File
@@ -0,0 +1,30 @@
# CSV Import Module for Biological Sequences (`obiformats`)
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
## Core Features
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
- **Metadata Handling**:
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
- **Multiple Entry Points**:
- `ReadCSV`: From any `io.Reader`.
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
- `ReadCSVFromStdin`: Reads from standard input.
- **Error & Edge Handling**:
- Gracefully handles empty files/streams via `ReadEmptyFile`.
- Uses structured logging (Logrus) for fatal and informational messages.
## Integration
Designed to integrate with OBItools4s core types:
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
## Use Case
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
# CSVSequenceRecord Function Description
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
## Core Features
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
## Design Highlights
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
- Handles missing data consistently via `opt.CSVNAValue()`.
- Supports both standard and user-defined metadata fields.
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
# `CSVTaxaIterator` Function — Semantic Description
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
### Core Functionality:
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
### Configurable Output Fields (via options):
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
- `scientific_name`: Full scientific name of the taxon.
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
### Implementation Highlights:
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
- Dynamically builds CSV headers based on selected options before processing begins.
### Use Case:
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
## CSV Taxonomy Loader for OBITools4
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
### Key Features:
- **Robust CSV Parsing**: Uses Gos `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
- **Taxonomy Construction**:
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
- Ensures existence of a root node; returns error otherwise.
- **Metadata Extraction**:
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
- Logs key metadata for traceability.
- **Scalable Design**:
- Processes records line-by-line (memory-efficient).
- Supports large datasets via streaming CSV reading.
### Input Format:
CSV must contain exactly four columns (case-sensitive headers):
- `taxid`: Unique taxon identifier.
- `parent`: Parent taxonomic node ID (empty for root).
- `scientific_name`: Binomial or descriptive name.
- `taxonomic_rank`: e.g., *species*, *genus*.
### Output:
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats.WriterDispatcher`
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
# EcoPCR File Parser for Biological Sequences
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
## Key Features
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
- Name (with deduplication support)
- Nucleotide/protein sequence
- Comment field
- **Structured Annotation**: Populates rich annotations including:
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
- Primer matching info (`forward_match`, `reverse_mismatch`)
- Melting temperatures (if present in v2)
- Amplicon length and strand orientation
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
## Implementation Highlights
- Custom line reader (`__readline__`) for robust header parsing.
- CSV parser configured with `|` delimiter and comment support (`#`).
- Deduplication of sequence names using a running count suffix.
- Concurrent goroutine-based streaming to decouple I/O and processing.
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
+17
View File
@@ -0,0 +1,17 @@
# EMBL Format Parser for OBITools4
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
- **Two Parsing Modes**:
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
- **Configurable Options**:
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
- **Integration**: Outputs are compatible with OBITools4s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
## `ReadEmptyFile` Function — Semantic Description
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
- **Signature**:
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
- **Behavior**:
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
- **Output**:
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
### Semantic Role & Use Cases
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
### Design Notes
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
# FASTA Parser Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
## Core Functionalities
- **`FastaChunkParser(UtoT bool)`**
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
- **`FastaChunkParserRope(...)`**
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
- **`ReadFasta(reader io.Reader, ...)`**
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
- **`EndOfLastFastaEntry(...)`**
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
## Key Features
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
## Design Highlights
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
- Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
# FASTQ Parsing Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
## Core Functionalities
- **`EndOfLastFastqEntry(buffer []byte) int`**
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
- **`FastqChunkParser(...)`**
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
- Header parsing (`@id [definition]`)
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
- Quality score shifting (`quality_shift`)
- Strict validation (e.g., `+` line, matching sequence/length)
- **`FastqChunkParserRope(...)`**
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
Enables concurrent, chunked parsing of large files:
- Splits input into chunks using `ReadFileChunk`
- Uses configurable parallel workers (`nworker`)
- Pushes parsed batches to an iterator interface
- **Convenience I/O Wrappers**
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
## Key Options & Features
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
## Design Highlights
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
## Semantic Description of `obiformats` Package
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
Two main constructor functions enable flexible formatting:
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
# Semantic Description of `obiformats` Package
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
## Core Functions
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
Dynamically routes header parsing based on the first character of the sequence definition:
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
- Otherwise invokes `ParseFastSeqOBIHeader`.
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
Applies header parsing to a *batch* of sequences:
- Takes an iterator over `BioSequence`s.
- Uses optional configuration (e.g., parallelism, parsing behavior).
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
## Design Principles
- **Format agnosticism**: Automatically detects header type.
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
# `FormatHeader` Function Type in `obiformats`
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
- **Package**: `obiformats`
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
- **Type Definition**:
```go
type FormatHeader func(sequence *obiseq.BioSequence) string
```
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
- **Semantic Role**:
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
- **Usage Context**:
- Used by writers/formatters to produce standardized headers when exporting sequences.
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
- **Dependencies**:
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
- **Design Intent**:
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
- **JSON Parsing Helpers**:
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
- **Header Interpretation**:
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
- Core fields (`id`, `definition`, `count`)
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
- **Sequence Annotation Enrichment**:
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
- **Serialization Support**:
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
- **Error Handling**:
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
# OBIFormats Package: Semantic Description
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
- Three core parsing functions detect value types:
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
- Numeric values are stored as integers if they have no fractional part.
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
- `*_count``map[string]int`,
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
- `*_status`/`*_mutation``map[string]string`.
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequences definition line, moving annotations into its metadata map and preserving leftover text.
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
- Strings and booleans use `key=value;`.
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
# FastSeq Reader Module — Semantic Description
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
## Core Features
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
- **Stdin & File I/O**: Two entry points:
- `ReadFastSeqFromFile(filename, ...)` for regular files.
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
## Integration
Built on top of `obitools4`s core abstractions:
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
# `obiformats` Package Overview
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
## Core Formatting Functions
- **`FormatFasta(seq, formater)`**
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
## File Writing Functions
- **`WriteFasta(iterator, file, options...)`**
Writes a stream of sequences to any `io.WriteCloser`. Supports:
- Parallel workers (`ParallelWorkers`)
- Chunked writing via `WriteFileChunk`
- Optional compression (e.g., gzip)
Returns a new iterator mirroring the input for pipeline chaining.
- **`WriteFastaToStdout(iterator, options...)`**
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
- **`WriteFastaToFile(iterator, filename, options...)`**
Writes to a named file with:
- Truncation or append mode (`AppendFile`)
- Automatic paired-end output if `HaveToSavePaired()` is enabled
(writes reverse reads to a secondary file specified via `PairedFileName`)
## Key Design Highlights
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
# FASTQ Output Module (`obiformats`)
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
## Core Functionality
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
## Header Customization
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
## Writing to Streams/Files
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
- Append/truncate modes
- Paired-end output (splits iterator and writes to two files)
- Automatic compression via `obiutils.CompressStream`
## Parallelization & Robustness
- Uses goroutines to parallelize formatting/writing across multiple workers.
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
## Integration
Designed to work seamlessly with the `obitools4` ecosystem:
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
- Extensible through functional options (`WithOption`) for configuration.
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
# `obiformats` Package Overview
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
# Semantic Description of `obiformats` Package Functionalities
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
- **`ReadFileChunk()`**: Core function that:
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
- Extends chunks incrementally (e.g., +1MB) until a full record boundary is found via `splitter`;
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
- **Key semantics**:
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
- *Streaming-first design* — supports large files without full loading into memory.
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
# `WriteFileChunk` Function — Semantic Description
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
- **Input**:
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
- **Core Behavior**:
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
- **Buffer Management**:
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
- **Error Handling**:
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
- **Cleanup & Lifecycle**:
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
- **Use Case**:
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
# GenBank Parser Module (`obiformats`)
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
## Core Functionalities
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
- **Parallel streaming I/O**:
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
## Key Design Decisions
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
## Output
Returns a batched iterator of `BioSequence` objects, each containing:
- Identifier (`id`)
- Compact nucleotide sequence
- Definition line (as description)
- Source file origin
- Optional feature table bytes
- Annotations: `scientific_name`, `taxid`
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
# JSON Output Module for Biological Sequences (`obiformats`)
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
- `"id"`: Sequence identifier.
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
- `"qualities"` (optional): Quality scores as a string if available.
- `"annotations"` (optional): Metadata annotations map.
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
- Parallel workers (configurable via options).
- Automatic compression (`gzip`/`bgzip`) if enabled.
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
- Atomic ordering to preserve sequence integrity across parallel writes.
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
- Outputs to stdout or a file (with append/truncate control).
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
- **Internal helpers**:
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9``\u00E9`).
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
# NCBI Taxonomy Loader Module (`obiformats`)
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
Key features:
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
- Efficient buffered reading (`bufio.Reader`) for large files.
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
## NCBI Taxonomy Archive Support in `obiformats`
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
### Core Functionalities
1. **Archive Validation (`IsNCBITarTaxDump`)**
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
- Steps include:
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
- Sets the root taxon to NCBIs default (`taxid = 1`, i.e., *root*).
3. **Integration with Other Modules**
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
### Key Parameters
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
- `seqAsTaxa`: Reserved for future use; currently unused.
### Logging & Error Handling
- Uses `logrus` to log loading progress and counts.
- Returns descriptive errors if required files or the root taxon are missing.
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
# Newick Format Export Functionality in `obiformats`
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
## Core Components
- `Tree`: A struct modeling a node in a Newick tree, containing:
- `Children`: list of child nodes (nested trees),
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
- `Length`: optional branch length (evolutionary distance).
- **`Newick()` methods**:
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
- **Writing Functions**:
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
- Accepts an iterator over taxa (`*obitax.ITaxon`).
- Validates single-taxonomy input.
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
## Configuration Options
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
## Semantic Summary
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
# NGSFilter Configuration Parser — Semantic Overview
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
## Core Functionality
- **Format Detection**:
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
- **Dual Input Parsing**:
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
- Primer pairs (`forward`, `reverse`)
- Tag pairs (with optional `-` for untagged direction)
- Experiment/sample metadata
- OBIFeatures annotations (via `ParseOBIFeatures`)
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
Additional columns are stored as annotations.
- **Parameter Configuration**:
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
- Error tolerance:
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
- Indel handling:
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
- **Validation & Integrity Checks**:
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
- Duplicate tag-pair detection per marker (error on reuse).
- Strict column/field validation with informative error messages.
- **Logging & Observability**:
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
## Design Highlights
- **Extensibility**: New parameters can be added via `library_parameter` map.
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats` Package Functionalities
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
Key capabilities include:
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
## Core Functionality
- **`newRopeScanner(rope *PieceOfChunk)`**
Constructs a new scanner starting at the root of the rope.
- **`ReadLine() []byte`**
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
- Returns `nil` when the end of the rope is reached.
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
- The returned slice aliases rope data and is only valid until the next call.
- **`skipToNewline()`**
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
## Implementation Highlights
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the ropes underlying data.
## Use Case
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
# Taxonomy Loading Module (`obiformats`)
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
## Core Features
1. **Format Detection**
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
- Supports:
• NCBI Taxdump (both directory and `.tar` archive)
• CSV files (`text/csv`)
• FASTA/FASTQ sequences (via `mimetype` detection)
2. **Modular Loaders**
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
3. **Sequence-Based Taxonomy Extraction**
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
4. **Integration with OBITools Ecosystem**
- Leverages `obitax.Taxonomy` as the canonical output structure.
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
5. **Error Handling & Logging**
- Graceful failure with descriptive errors; informative logging via `logrus`.
## Usage Flow
```go
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
```
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
# OBIFORMATS Package: Semantic Description
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
- **CSV** (`text/csv`): generic tabular support.
Core functionality is exposed through:
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
Internally leverages:
- `obiutils.Ropen()` for unified file opening (including stdin handling).
- Path extension stripping and source tagging via `OptionsSource()`.
- Logging (`logrus`) for format diagnostics.
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
# `obiformats` Package: Sequence Writing Utilities
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
## Core Functionality
- **`WriteSequence()`**:
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
- Preserves iterator state via `PushBack()` to allow chaining.
- **`WriteSequencesToStdout()`**:
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
- **`WriteSequencesToFile()`**:
Writes sequences to a specified file. Supports:
- File creation/truncation or append mode (`OptionAppendFile()`).
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
## Design Highlights
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
## Integration
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
+13
View File
@@ -0,0 +1,13 @@
## Uint128 Type in `obifp`: Semantic Overview
This Go package defines a custom 128-bit unsigned integer type (`Uint128`) composed of two `uint64` limbs (high and low). It provides comprehensive arithmetic, comparison, bitwise operations, and type conversions.
- **Basic Constructors**: `Zero()`, `MaxValue()` initialize the smallest/largest possible values.
- **State Checks**: `IsZero()`, and equality/comparison methods (`Equals`, `Cmp`, `<`, `>`, etc.) enable conditional logic.
- **Type Casting**: Safe conversions to/from smaller (`Uint64`, `uint64`) and larger (`Uint256`) integer types, with overflow warnings where applicable.
- **Arithmetic**: Full support for addition (`Add`, `Add64`), subtraction (`Sub`), multiplication (`Mul`, `Mul64`) — with panic on overflow.
- **Division & Modulo**: Integer division (`Div`, `Div64`) and remainder (`Mod`, `Mod64`), implemented via optimized quotient-remainder pairs (`QuoRem`, `QuoRem64`) using hardware-assisted 64-bit operations.
- **Bit Manipulation**: Left/right shifts (`LeftShift`, `RightShift`), and bitwise logic: AND, OR, XOR, NOT.
- **Utility**: Direct access to low limb via `AsUint64()`.
All operations preserve 128-bit precision, with strict overflow checking for correctness in high-precision contexts (e.g., bioinformatics counting).
+17
View File
@@ -0,0 +1,17 @@
# `obifp.Uint128` Package — Semantic Feature Overview
This Go package provides a 128-bit unsigned integer type (`Uint128`) with comprehensive arithmetic, comparison, and bitwise operations. Internally represented as two `uint64` limbs (`w1`: high, `w0`: low), it supports:
- **Arithmetic Operations**
- `Add`, `Sub`, `Mul` (128×128), and `Mul64` (scalar multiplication)
- Division: `Div`, `Mod`, and combined quotient/remainder via `QuoRem` (and their 64-bit variants)
- **Comparison & Equality**
- `Cmp`, `Equals`, `LessThan`/`GreaterThan`, and their inclusive variants (`≤`, `≥`)
- Support for comparing against both `Uint128` and native `uint64` values
- **Bitwise Operations**
- Logical AND (`And`), OR (`Or`), XOR (`Xor`) between two `Uint128`s
- Bitwise NOT (`Not`) — inverts all bits of the value
- **Conversion & Utility**
- `AsUint64()` safely truncates to lower 64 bits (assumes upper limb is zero)
All operations handle overflow/underflow correctly, including carry propagation in addition and borrow handling in subtraction. Tests cover edge cases: zero values, max `uint64` boundaries (e.g., wrapping in addition/subtraction), and large multiplications. Designed for cryptographic or high-precision numeric use where native integer types are insufficient.
+30
View File
@@ -0,0 +1,30 @@
# Uint256 Type and Operations — Semantic Overview
The `obifp` package provides a custom 256-bit unsigned integer type (`Uint256`) implemented in Go, composed of four 64-bit limbs (`w0` to `w3`). It supports arithmetic, comparison, bitwise operations, and safe casting with overflow detection.
- **Core Representation**: `Uint256` stores values as four 64-bit words, enabling arbitrary-precision unsigned integers up to $2^{256} - 1$.
- **Utility Methods**:
- `Zero()` / `MaxValue()`: Return the neutral and maximum values.
- `IsZero()`, `Equals(v)`, comparison methods (`LessThan`, etc.): Enable logical and ordering checks.
- **Casting & Conversion**:
- `Uint64()`, `Uint128()` downcast with warnings on overflow.
- `Set64(v)`: Initializes from a standard `uint64`.
- `AsUint64()`: Direct access to least-significant limb.
- **Bitwise Operations**:
- `And`, `Or`, `Xor`, `Not`: Standard bitwise logic per limb.
- **Shifts**:
- `LeftShift(n)` / `RightShift(n)`: Multi-limb shifts with carry propagation.
- **Arithmetic**:
- `Add(v)`, `Sub(v)` / `Mul(v)`: Use Gos `math/bits` for carry-aware operations; panic on overflow.
- `Div(v)`: Implements long division via repeated subtraction of shifted multiples; panics on zero divisor.
- **Safety & Logging**:
- Warnings via `obilog.Warnf` for silent overflows during narrowing casts.
- Panics on arithmetic overflow or division-by-zero using `log.Panicf`.
This type is suitable for cryptographic, genomic (OBITools), or high-precision counting use cases requiring precise control over large unsigned integers.
+34
View File
@@ -0,0 +1,34 @@
# Uint64 Type Functionalities Overview
The `obifp` package provides a custom `Uint64` type wrapping Gos native 64-bit unsigned integer (`uint64`) to support arithmetic, bitwise operations, and type conversions in a structured way.
## Core Operations
- **`Zero()` / `MaxValue()`**: Returns the zero and maximum representable values, respectively.
- **`IsZero()` / `Equals(v)`**: Checks if the value is zero or equal to another.
- **`Cmp(v)`, `LessThan(v)`**, etc.: Standard comparison operations returning `-1/0/+1` or boolean results.
## Arithmetic with Overflow Detection
- **Add/Sub/Mul**: Performs 64-bit addition, subtraction, and multiplication.
- Uses `math/bits` for low-level operations (`bits.Add64`, etc.).
- Panics on overflow (carry ≠ 0), enforcing strict safety.
## Bitwise Operations
- **`And`, `Or`, `Xor`, `Not()`**: Standard bitwise logic operations.
- **`LeftShift(n)` / `RightShift(n)`**:
- Shifts bits left/right by *n* positions.
- Uses internal `LeftShift64`/`RightShift64`, supporting *carry-in* for multi-word arithmetic.
## Extended Precision Conversions
- **`Uint128()` / `Uint256()`**: Casts the 64-bit value into larger unsigned integer types (zero-extended).
- **`Set64(v)`**: Reassigns the internal value from a raw `uint64`.
## Utility & Logging
- **`AsUint64()`**: Extracts the underlying `uint64`.
- **Warning on overflow in shift operations** (e.g., shifts ≥ 128 bits) via `obilog.Warnf`.
> Designed for use in high-precision or cryptographic contexts where explicit overflow handling and type safety are critical.
+32
View File
@@ -0,0 +1,32 @@
# Obifp Package: Generic Fixed-Point Unsigned Integer Operations
This Go package (`obifp`) provides a generic, type-safe interface for fixed-point unsigned integer arithmetic over three size variants: `Uint64`, `Uint128`, and `Uint256`.
## Core Interface: `FPUint[T]`
The interface defines a unified API for unsigned integer types, supporting:
- **Initialization & Conversion**:
- `Zero()`, `Set64(v)`: Create zero or set from a `uint64`.
- `AsUint64()`: Downcast to standard `uint64`.
- **Logical Operations**:
- Bitwise: `And`, `Or`, `Xor`, `Not`.
- Shifts: `LeftShift(n)`, `RightShift(n)`.
- **Arithmetic**:
- Addition (`Add`), subtraction (`Sub`), multiplication (`Mul`). Division is commented out—likely reserved for future implementation.
- **Comparison**:
- Full ordering: `<`, `<=`, `>`, `>=`.
- **Utility Predicates**:
- `IsZero()` for zero-checking.
## Helper Functions
- `ZeroUint[T]`: Returns the neutral element (zero) for type `T`.
- `OneUint[T]`: Constructs value 1 via `Set64(1)`.
- `From64[T]`: Converts a standard Go `uint64` into the generic type.
All operations are **method-chaining friendly** (return `T`, not pointers), enabling fluent syntax. The design promotes correctness and performance in cryptographic or financial contexts where large, fixed-size integers are required.
+30
View File
@@ -0,0 +1,30 @@
# `obigraph` Package: Semantic Overview
The `obigraph` package provides a generic, type-safe undirected/directed graph implementation in Go. Its core features include:
- **Generic Graph Structure**: Parametrized over vertex type `V` and edge data type `T`, enabling flexible use with arbitrary user-defined types.
- **Bidirectional Edge Tracking**: Maintains both forward (`Edges`) and reverse (`ReverseEdges`) adjacency maps for efficient neighbor/parent queries.
- **Edge Management**:
- `AddEdge`: Adds an *undirected* edge (inserted in both directions).
- `AddDirectedEdge`: Adds a *directed* edge (only one direction).
- `SetAsDirectedEdge`: Converts an existing undirected edge into a directed one by removing the reverse link.
- **Graph Queries**:
- `Neighbors(v)`: Returns all adjacent vertices (outgoing in directed case).
- `Parents(v)`: Returns incoming neighbors via reverse adjacency.
- `Degree(v)` / `ParentDegree(v)`: Compute vertex degrees (total or incoming).
- **Customizable Vertex/Edge Properties**:
- `VertexWeight`, `EdgeWeight`: Funcs to assign weights (default: constant weight = 1.0).
- `VertexId`: Custom vertex label generator (default: `"V%d"`).
- **GML Export**:
- `Gml(...)` / `WriteGml(...)`: Generates or writes a Graph Modelling Language (GML) representation.
- Supports directed/undirected modes, degree-based filtering (`min_degree`), and visual styling:
- Vertex shape: `circle` if weight ≥ threshold, else `rectangle`.
- Size scaled by square root of vertex weight.
- Uses Gos `text/template` for rendering.
- **File I/O**: Directly writes GML to file via `WriteGmlFile(...)`.
- **Logging & Safety**: Uses Logrus for bounds-checking errors; panics on template parsing/writing failures.
The package is designed for lightweight, high-performance graph modeling and visualization-ready export.
+14
View File
@@ -0,0 +1,14 @@
# `obigraph.GraphBuffer` Feature Overview
The `GraphBuffer[V, T]` type provides a **thread-safe graph construction interface** using buffered edge insertion via Go channels.
- **Asynchronous Edge Addition**: Edges are enqueued through a `chan Edge[T]`, processed in the background by a goroutine that updates an underlying static graph (`Graph[V, T]`).
- **Non-blocking API**: `AddEdge` and `AddDirectedEdge` are non-synchronous — they send to the channel without waiting for graph mutation, enabling high-throughput edge ingestion.
- **Graph Initialization**: `NewGraphBuffer` initializes both the graph and a dedicated worker goroutine to consume edges.
- **GML Export Support**: Full support for exporting the final graph in [Graph Modelling Language (GML)](https://en.wikipedia.org/wiki/Graph_Modelling_Language), with optional filtering (`min_degree`) and layout parameters (`threshold`, `scale`).
- **File & Stream Output**: Methods `WriteGml` and `WriteGmlFile` allow writing GML to any `io.Writer`, including files.
- **Resource Cleanup**: The explicit `Close()` method terminates the worker goroutine by closing the channel, ensuring clean shutdown.
- **Generic Design**: Fully generic over vertex (`V`) and edge data types (`T`), supporting arbitrary value semantics.
> ⚠️ **Note**: The buffer is *not* safe for concurrent `AddEdge` calls without external synchronization beyond channel semantics.
> ✅ Ideal for producer-consumer patterns where edges are streamed from multiple goroutines into a single graph.
+29
View File
@@ -0,0 +1,29 @@
# BioSequenceBatch: A Container for Ordered Biological Sequences
`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
## Core Properties
- **`source`**: String identifying the origin (e.g., file, pipeline stage).
- **`order`**: Integer defining processing sequence or priority.
- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
## Key Functionalities
- **Construction**:
`MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
- **Accessors**:
`Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
- **Mutation (via copy)**:
`Reorder(newOrder)` returns a new batch with updated order.
- **Size & emptiness**:
`Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
- **Consumption**:
`Pop0()` removes and returns the first sequence (FIFO behavior).
- **Safety**:
`IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
## Design Notes
- Instances are value types (struct), enabling safe copying.
- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
# `obiiter`: Stream-Based Biosequence Iterator Library
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
## Core Functionality
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
## Iterator Management
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
- **Lifecycle Control**:
- `Add(n)`, `Done()`: Track active workers (like goroutines).
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
## Batch Transformation & Reorganization
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
- **Concatenation & Pooling**:
- `Concat(...)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
## Filtering & Predicate-Based Processing
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
## Utility & Analysis
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
## Additional Features
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
- Batch ordering preserved for downstream reproducibility.
- Integrates with OBITools4s `obidefault`, `obiutils` for config and resource management.
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
+32
View File
@@ -0,0 +1,32 @@
# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
- **Key Fields**:
- `outputs`: A map from integer class codes to output streams (`IBioSequence`).
- `news`: An unbuffered channel emitting class codes when new output streams are created.
- `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
- **Batching Strategy**:
- Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
- Batches are flushed automatically and on finalization.
- **Asynchronous Processing**:
- The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
- Outputs are closed only after all sequences have been processed.
- **Notifications**:
- The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
- **Error Handling**:
- `Outputs(key)` returns an error if the requested key has no associated output.
- **Integration**:
- Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
- Uses `SortBatches()` on the input iterator to ensure ordered processing.
In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
# `ExtractTaxonomy` Function — Semantic Description
The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
- **Input**:
- A pointer to `IBioSequence`, representing a sequence iterator over biological data.
- A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
- **Process**:
- Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.
- For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.
- Stops and returns immediately upon encountering the first error during taxonomy extraction.
- **Output**:
- Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).
- Returns `nil` error on success; otherwise, returns the first encountered error.
- **Semantic Role**:
Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
- **Dependencies**:
Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
+28
View File
@@ -0,0 +1,28 @@
# `IFragments` Functionality Overview
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
## Core Parameters
- `minsize`: Minimum sequence length to skip fragmentation.
- `length`: Desired fragment size (in bases/amino acids).
- `overlap`: Number of overlapping residues between consecutive fragments.
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
## Workflow
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
2. **Parallel Fragmentation**:
- Each worker processes a subset of batches independently using goroutines.
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
3. **Resource Management**:
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
## Key Features
- **Overlap handling**: Ensures contiguous coverage without gaps.
- **Memory efficiency**: Uses recycling and batched output.
- **Scalability**: Leverages Go concurrency via `nworkers`.
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
## Use Case
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
+29
View File
@@ -0,0 +1,29 @@
# Memory-Limited Biosequence Iterator
This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
## Core Functionality
- **`LimitMemory(fraction float64)`**
Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
- **Memory Monitoring**
Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
- **Backpressure Mechanism**
While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
- **Logging**
Warns via `obilog.Warnf` when:
- Memory pressure persists (every ~1000 yields),
- Or wait duration becomes unusually long (>10,000 yielding cycles).
- **Concurrency Model**
- A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
- A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
## Semantic Behavior
- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
+19
View File
@@ -0,0 +1,19 @@
# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**
- Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
- Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
- For each batch:
- Collects up to `batchsize` sequences via the input iterator.
- Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
- Wraps merged results into a `BioSequenceBatch` with ordering metadata.
- Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**
- A *pipeline combinator* (higher-order function), enabling functional composition.
- Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
**Semantic Purpose**:
Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
+35
View File
@@ -0,0 +1,35 @@
# `NumberSequences` Function — Semantic Description
The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
## Core Functionality
- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
## Parallelization Strategy
- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
- **Reordering mode**: If `forceReordering` is true:
- Input iterator is batch-sorted (`SortBatches()`).
- Parallelism disabled (1 worker) to ensure deterministic numbering order.
## Implementation Details
- Each goroutine processes its own split of the input iterator.
- A shared `next_first` counter tracks the next available sequence number globally.
- Locking ensures atomic increment and assignment, preventing race conditions.
## Output
Returns a new `IBioSequence` iterator:
- Contains the same sequence batches (possibly reordered if sorted).
- Each `BioSequence` object now carries a `"seq_number"` attribute.
- Paired sequences are co-numbered and marked accordingly.
## Use Cases
- Preparing data for downstream tools requiring unique sequence IDs.
- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
- Reproducible numbering across pipeline stages or restarts.
+17
View File
@@ -0,0 +1,17 @@
# Paired-End Sequence Handling in `obiiter`
This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
- `BioSequenceBatch` methods:
- **`IsPaired()`**: Checks whether the batch contains paired reads.
- **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
- **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
- **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
- `IBioSequence` (iterator) methods:
- **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
- **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
- **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
- **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
+17
View File
@@ -0,0 +1,17 @@
# Semantic Description of `obiiter` Package Features
This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.
- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.
- Uses goroutines to ensure non-blocking parallel consumption.
- Ensures proper lifecycle management: closing the second stream when the first is closed.
- Preserves paired-end status (`MarkAsPaired`) if applicable.
Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
@@ -0,0 +1,28 @@
# `MakeSetAttributeWorker` Functionality Overview
The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequences metadata.
- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
- **Use case example**:
```go
worker := MakeSetAttributeWorker("species")
seq = worker(seq) // annotates `seq` with species-level taxon
```
- **Assumptions**:
- `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.
- Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).
- Sequences carry mutable taxonomic metadata.
- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
+31
View File
@@ -0,0 +1,31 @@
# `Speed` Functionality Description
The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
## Core Features
- **Non-intrusive progress bar**:
The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
- **Conditional rendering**:
The progress bar is only shown when:
- `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
- stderr is connected to a terminal (`os.ModeCharDevice`),
- stdout is *not* piped (to avoid interfering with file output).
- **Batch-aware counting**:
Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
- **Paired-end support**:
If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
- **Pipeable wrapper**:
`SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
## Implementation Highlights
- Uses goroutines to decouple iteration and progress updates.
- Automatically closes the output iterator when input ends (`WaitAndClose()`).
- Prints a final newline to stderr upon completion.
This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
+20
View File
@@ -0,0 +1,20 @@
# Semantic Description of `obiiter` Package Functionalities
This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:
Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.
+33
View File
@@ -0,0 +1,33 @@
# `obiitercsv`: CSV Record Iterator for Streaming and Batch Processing
This Go package provides a thread-safe, channel-based iterator (`ICSVRecord`) for streaming and processing CSV records in batches. It supports ordered batch handling, concurrent access via mutexes, and dynamic header management.
## Core Types
- **`CSVHeader`**: A slice of strings representing column names.
- **`CSVRecord`**: A map from field name to value (`map[string]interface{}`).
- **`CSVRecordBatch`**: A batch of records with metadata: `source`, `order`, and the actual data slice.
## Key Features
- **Streaming via Channels**: Records are consumed as `CSVRecordBatch` items through a channel, enabling asynchronous producers/consumers.
- **Ordered Processing**: Batches include an `order` field, used by `SortBatches()` to reconstruct sequential order even when received out-of-order.
- **Thread Safety**: Uses `sync.RWMutex`, atomic operations (`batch_size`), and `abool.AtomicBool` for flags like `finished`.
- **Iterator Protocol**: Implements standard methods:
- `Next()` to advance,
- `Get()` to retrieve current batch,
- `PushBack()` for re-queuing the last record.
- **Batch Management**:
- `SetHeader()` / `AppendField()`: dynamic header updates.
- `Split()`: creates a new iterator sharing the same channel but with independent locking.
- **Lifecycle Control**:
- `Add()` / `Done()`: track active goroutines (via `sync.WaitGroup`).
- `WaitAndClose()` ensures all data is flushed before closing the channel.
## Utility Methods
- **`NotEmpty()`, `IsNil()`**: Check batch validity.
- **`Consume()`**: Drains the iterator (e.g., for side-effect processing).
- **`SortBatches()`**: Reorders batches by `order`, buffering out-of-sequence ones.
Designed for bioinformatics pipelines (e.g., OBITools4), it enables scalable, memory-efficient CSV processing with strict ordering guarantees.
+36
View File
@@ -0,0 +1,36 @@
# Semantic Description of `obikmer` Package
This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
## Core Functionalities
1. **`Count4Mer(seq, buffer, counts)`**
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
- Encodes each 4-mer into an integer (0255) using `Encode4mer`.
- Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
- Reuses or allocates the `counts` buffer as needed.
2. **`Common4Mer(count1, count2)`**
Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
Used to measure shared content between sequences.
3. **`Sum4Mer(count)`**
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
## Distance & Similarity Bounds
4. **`LCS4MerBounds(count1, count2)`**
Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
- **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
- **Upper bound**: `min(total1, total2) + 3 ceil((min_total common)/4)`
Leverages the fact that overlapping k-mers constrain possible alignments.
5. **`Error4MerBounds(count1, count2)`**
Estimates bounds for *alignment errors* (e.g., mismatches + indels):
- **Upper bound**: `max_total common_kmers + 2 * floor((common_kmers + 5)/8)`
- **Lower bound**: `ceil(upper_bound / 4)`
Provides fast, approximate error estimates without full alignment.
## Use Case
Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
+44
View File
@@ -0,0 +1,44 @@
# Semantic Description of the `obikmer` Package
This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
### Core Functionalities
- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
### Graph Operations
- **Node Queries**:
- `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.
- `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
- **Path Exploration**:
- `MaxPath()`: Greedily traces the highest-weight path from a head.
- `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
- `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
### Consensus & Filtering
- **Consensus Generation**:
- `BestConsensus()` returns a sequence from the greedy max-weight path.
- `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
- **Weight Statistics**:
- `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.
- `FilterMinWeight(min)` removes low-count nodes.
- **Decoding**:
- `DecodeNode()` converts a k-mer index to its DNA string.
- `DecodePath()` reconstructs the full consensus from a path.
### I/O & Diagnostics
- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
### Dependencies & Design
- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Gos stdlib.
- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
@@ -0,0 +1,35 @@
# Semantic Description of `obikmer` Package
The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
## Core Functionalities
1. **Nucleotide Encoding**
- `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:
`0→A`, `1→C`, `2→G`, `3→T/U`.
Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).
Uses a lookup table for O(1) performance.
2. **4-mer Encoding**
- `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.
Each k-mer is encoded as an unsigned byte (0255), where each nucleotide contributes 2 bits.
Supports optional buffer reuse for memory efficiency.
3. **4-mer Indexing**
- `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0255) to all its occurrence positions in the sequence.
Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
4. **Fast Sequence Comparison**
- `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.
- Uses precomputed 4-mer index of a reference sequence and encodes the query.
- Counts co-occurring 4-mers across all possible shifts (`refpos queryPos`).
- Computes raw and relative scores (normalized by alignment length).
- Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
## Design Highlights
- **Memory-aware**: Supports buffer reuse to minimize allocations.
- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).
- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.
Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
+39
View File
@@ -0,0 +1,39 @@
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
## Core Encoding & Decoding
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
## Iterators (Memory-Efficient Streaming)
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 03). Only valid for **odd k ≤ 31**.
## Error Handling & Markers
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
## Reverse Complement & Circular Normalization
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
## Counting & Math Utilities
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreaus necklace formula**, with Euler's totient function and divisor enumeration.
## Performance & Safety
- All functions avoid heap allocations where possible (reusing buffers).
- Panics on invalid `k` or length mismatches for correctness.
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
## Use Cases
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
- Error-aware k-mer filtering in sequencing pipelines
- Low-complexity region detection via circular entropy normalization
@@ -0,0 +1,36 @@
# Obikmer: Efficient K-mer Encoding and Manipulation in Go
This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
## Core Functionalities
### K-mer Encoding (`EncodeKmers`, `IterKmers`)
Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
### Reverse Complement (`ReverseComplement`)
Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
### Super *k*-mers Extraction (`ExtractSuperKmers`)
Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
## Key Features
- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
## Use Cases
- Genome assembly &DBG construction
- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)
- Error-aware k-mer counting & filtering
- Strand-unbiased sequence comparison
All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
+31
View File
@@ -0,0 +1,31 @@
# Semantic Description of `obikmer` Entropy Functions
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
## Core Functionality
- **`KmerEntropy(kmer, k, levelMax)`**:
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
- Normalized entropy = `(log(N) Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
- **`KmerEntropyFilter`**:
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
- **Not goroutine-safe** — each thread must instantiate its own filter.
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
- **`Accept(kmer)` / `Entropy(kmer)`**:
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
## Design Highlights
- **Circular canonical normalization** ensures symmetry (e.g., `AT``TA`).
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
+37
View File
@@ -0,0 +1,37 @@
# K-Way Merge for Sorted k-mer Streams
This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
## Core Components
- **`mergeItem`**: Stores a value and its source reader index for heap operations.
- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
- **`KWayMerge`**: Main struct managing the heap and input readers.
## Key Functionality
- **Initialization (`NewKWayMerge`)**:
- Takes a slice of `*KdiReader`, each expected to yield sorted values.
- Preloads the heap with one value from each reader.
- **Streaming Output (`Next`)**:
- Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
- Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
- **Cleanup (`Close`)**:
- Closes all underlying `KdiReader`s and returns the first encountered error.
## Use Case
Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
- Efficient deduplication with multiplicity tracking.
- Scalable union/intersection operations on large *k*-mer sets.
## Complexity
| Operation | Time |
|-----------|------------|
| `Next()` | *O(log k)* (heap ops per unique value) |
| Init | *O(k)* |
Where `k` = number of input readers.
@@ -0,0 +1,27 @@
# K-Way Merge Functionality in `obikmer`
This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
## Key Features
- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
- **Robust Test Coverage**: Includes unit tests for:
- Basic merging with overlapping and non-overlapping values.
- Single-stream input (degenerate case).
- Empty streams handling.
- All identical k-mers across inputs.
## API Highlights
- `NewKdiReader(path)` — opens a `.kdi` file for reading.
- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
- `.Next()``(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
- `.Close()` — cleans up resources.
## Use Case
Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each samples k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
+27
View File
@@ -0,0 +1,27 @@
# KDI Reader: Streaming Delta-Varint Decoding for k-mers
The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
## Core Features
- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
## Key API
- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
- `Count()` / `Remaining()` → total and unread k-mers.
- `Close()` → releases file handle.
## Design Highlights
- Uses 64KB buffer for efficient I/O.
- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
+34
View File
@@ -0,0 +1,34 @@
# KDI File Format and API
The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
## Core Features
- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
## File Structure
A `.kdi` file contains:
1. **Magic header** (4 bytes): Identifies the format.
2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
3. **First value** (8 bytes, `uint64`): Initial k-mer.
4. **Delta-encoded tail**: `(n1)` deltas, each encoded as a `varint`.
## API
- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
- **`Writer.Count()`**: Returns the number of written items before closing.
- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
- **`Reader.Count()`**: Returns total stored count.
## Tests Validate
1. Basic round-trip with diverse values (including large `uint64`s).
2. Empty and single-k-mer files.
3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
+38
View File
@@ -0,0 +1,38 @@
# KDI File Format and Writer
The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
## Core Format (`.kdi`)
- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
## Writer (`KdiWriter`)
- **Strict ordering**: K-mers must be written in *strictly increasing order*.
- Efficient buffering via `bufio.Writer` (64 KB buffer).
- Internally tracks:
- Current k-mer count,
- Previous value (for delta computation),
- Bytes written in data section.
- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
## Companion Index (`.kdx`)
- Written automatically on `Close()` if indexing entries exist.
- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
- Enables efficient random access without full file scan.
## Usage Pattern
```go
w, _ := obikmer.NewKdiWriter("data.kdi")
for _, kmer := range sortedKMers {
w.Write(kmer)
}
w.Close() // finalizes header, writes .kdx index
```
The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
+29
View File
@@ -0,0 +1,29 @@
# KDX Index Format and Functionality
The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
## Core Concepts
- **Magic bytes**: `KDX\x01` validates file integrity.
- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
- **Entry structure**: Each entry stores:
- `kmer`: the k-mer value at that index position.
- `offset`: absolute byte offset in the corresponding `.kdi` file.
## Key Operations
- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
- Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
- Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
## Performance
- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
## Design Philosophy
Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obikmer` Package
The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
- `QueryEntry` represents a single canonical kmer with its origin: sequence index and 1-based position.
- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each kmer to a partition via modulo hashing, and sorts buckets by kmer value.
- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
- Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
- Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and kmer stream in lockstep.
The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
@@ -0,0 +1,49 @@
# `obikmer` K-mer Set Group Builder — Functional Overview
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
## Core Features
- **K-mer & Minimizer Configuration**:
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
- **Functional Options for Filtering**:
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
- **Concurrent & Pipeline-Aware Processing**:
Uses a two-stage pipeline: *I/O-bound readers* (24 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
- **Partitioned I/O & Thread Safety**:
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
## Workflow
1. **Build Phase**:
- Input sequences → super-kmers extracted via minimizer-based partitioning.
- Super-kmers written to `.build/set_*/part_*.skm`.
2. **Finalization (`Close()`)**:
- `.skm` files loaded → canonical k-mers extracted.
- K-mers sorted, counted (frequency spectrum), and filtered per config.
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
3. **Append Mode**:
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
## Output Artifacts
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
## Design Highlights
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
- **Robust error handling**: Early termination on first failure; cleanup of partial state.

Some files were not shown because too many files have changed in this diff Show More