mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,22 @@
|
||||
# `obialign` Package: Sequence Alignment Utilities
|
||||
|
||||
The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
|
||||
|
||||
- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
|
||||
|
||||
- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
|
||||
- At mismatches, it retains the higher-quality base.
|
||||
- When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
|
||||
- Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
|
||||
- Optionally records mismatch statistics in sequence attributes.
|
||||
|
||||
- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
|
||||
|
||||
- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
|
||||
|
||||
- **Alignment Path Format**: The path is a sequence of signed integers encoding:
|
||||
- Negative steps → deletions in seqB (insertion in A),
|
||||
- Positive steps → insertions in B,
|
||||
- Consecutive pairs encode match/mismatch runs.
|
||||
|
||||
This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
|
||||
@@ -0,0 +1,30 @@
|
||||
# Semantic Description of `obialign` Backtracking Module
|
||||
|
||||
The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**:
|
||||
- `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.
|
||||
- `lseqA`, `lseqB`: Lengths of sequences A and B.
|
||||
- `path`: Pre-allocated slice to store the traceback path.
|
||||
|
||||
- **Output**: A compact representation of alignment steps, alternating between:
|
||||
- Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
|
||||
- Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
|
||||
|
||||
## Algorithm Highlights
|
||||
|
||||
- **Reverse traversal** from `(lseqA−1, lseqB−1)` to origin.
|
||||
- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
|
||||
- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
|
||||
- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
|
||||
|
||||
## Encoded Path Semantics
|
||||
|
||||
Each pair in the returned slice encodes:
|
||||
- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
|
||||
|
||||
## Use Case
|
||||
|
||||
Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
|
||||
@@ -0,0 +1,26 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
|
||||
|
||||
## Key Functionalities
|
||||
|
||||
- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
|
||||
|
||||
- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
|
||||
|
||||
- **Log-space arithmetic helpers**:
|
||||
- `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
|
||||
- `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 − exp(a))` and `log(exp(a) − exp(b))`, critical for probability transformations.
|
||||
|
||||
- **Match/mismatch scoring (`_MatchScoreRatio`)**:
|
||||
- Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
|
||||
- Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
|
||||
|
||||
- **Precomputed scoring matrices**:
|
||||
- `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
|
||||
- `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
|
||||
|
||||
- **Thread-safe initialization**:
|
||||
- `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
|
||||
|
||||
All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
|
||||
@@ -0,0 +1,23 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
|
||||
|
||||
- **Core Encoding Strategy**:
|
||||
A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
|
||||
|
||||
- **`encodeValues(score, length int, out bool)`**:
|
||||
Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
|
||||
|
||||
- **`decodeValues(value uint64)`**:
|
||||
Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
|
||||
|
||||
- **Utility Bitwise Helpers**:
|
||||
- `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
|
||||
- `_incscore(value)`: increments score by `1 << wsize`.
|
||||
- `_setout(value)`: clears the out-flag, marking value as *not* terminal.
|
||||
|
||||
- **Predefined Constants**:
|
||||
- `_empty`: neutral state (score=0, length=0).
|
||||
- `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
|
||||
|
||||
This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
|
||||
@@ -0,0 +1,42 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
|
||||
|
||||
## Core Algorithm
|
||||
|
||||
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
|
||||
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
|
||||
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
|
||||
|
||||
## Scoring Scheme
|
||||
|
||||
- **Match**: +1 point
|
||||
- **Mismatch or gap (indel)**: 0 points
|
||||
|
||||
## Key Functions
|
||||
|
||||
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
|
||||
- Computes LCS score and alignment length between raw byte sequences.
|
||||
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
|
||||
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
|
||||
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
|
||||
|
||||
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
|
||||
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
|
||||
- Designed for standard biosequence inputs.
|
||||
|
||||
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
|
||||
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
|
||||
|
||||
## Features
|
||||
|
||||
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
|
||||
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
|
||||
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
|
||||
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Molecular barcode/UMI clustering
|
||||
- Read-to-reference alignment in amplicon sequencing
|
||||
- Similarity filtering of biological sequences
|
||||
@@ -0,0 +1,15 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
|
||||
|
||||
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
|
||||
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
|
||||
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
|
||||
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
|
||||
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
|
||||
- **Lookup tables**:
|
||||
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
|
||||
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
|
||||
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
|
||||
|
||||
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
|
||||
@@ -0,0 +1,19 @@
|
||||
## `obialign` Package: Semantic Overview (≤50 lines)
|
||||
|
||||
The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
|
||||
|
||||
- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
|
||||
- `D1Or0(...)`: Returns a 4-tuple:
|
||||
- **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
|
||||
- **`int` (second)**: Position of the differing site (`-1` if identical).
|
||||
- **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
|
||||
|
||||
**Algorithmic strategy:**
|
||||
1. Early rejection if length difference exceeds 1.
|
||||
2. Forward scan until first mismatch → identifies left boundary of divergence.
|
||||
3. Backward scan from ends to find rightmost match boundary.
|
||||
4. Validates whether the mismatch region allows exactly one edit:
|
||||
- Single substitution: equal lengths, single divergent position.
|
||||
- Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
|
||||
|
||||
Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
|
||||
@@ -0,0 +1,29 @@
|
||||
# `LocatePattern` Functionality Overview
|
||||
|
||||
The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
|
||||
|
||||
- **Input**:
|
||||
- `id`: identifier for logging/error reporting.
|
||||
- `pattern []byte`: the query sequence (e.g., primer).
|
||||
- `sequence []byte`: the target read/contig.
|
||||
|
||||
- **Constraints**:
|
||||
- Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).
|
||||
|
||||
- **Scoring Scheme**:
|
||||
- Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).
|
||||
- Mismatch/Gap: `-1`.
|
||||
|
||||
- **Algorithm Features**:
|
||||
- End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.
|
||||
- Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).
|
||||
- Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).
|
||||
- Backtracks from the bottom-right to find optimal local alignment start/end coordinates.
|
||||
|
||||
- **Output**:
|
||||
- `start`: starting index in `sequence`.
|
||||
- `end+1`: ending index (exclusive) of best match.
|
||||
- Error count: `-score`, i.e., number of mismatches/gaps in alignment.
|
||||
|
||||
- **Use Case**:
|
||||
Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
|
||||
@@ -0,0 +1,37 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
### 1. **Memory Arena Management**
|
||||
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
|
||||
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
|
||||
|
||||
### 2. **Dynamic Programming Alignment Functions**
|
||||
Implements three specialized global alignment variants using Needleman–Wunsch with affine gap penalties (scaled per mismatch):
|
||||
|
||||
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
|
||||
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
|
||||
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
|
||||
|
||||
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
|
||||
|
||||
### 3. **Scoring & Quality Integration**
|
||||
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
|
||||
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
|
||||
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
|
||||
- A user-defined `scale` factor to modulate mismatch penalties.
|
||||
|
||||
### 4. **Fast Heuristic Pre-Alignment**
|
||||
The main `PEAlign` function integrates a kmer-based fast pre-screening:
|
||||
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
|
||||
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
|
||||
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
|
||||
|
||||
### 5. **Backtracking & Path Output**
|
||||
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
|
||||
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
|
||||
|
||||
### Use Case
|
||||
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
|
||||
@@ -0,0 +1,58 @@
|
||||
# Semantic Description of `obialign.ReadAlign`
|
||||
|
||||
The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:
|
||||
- `gap`: gap penalty (linear)
|
||||
- `scale`: scaling factor for quality scores
|
||||
- `delta`: extension buffer around initial overlap estimate
|
||||
- `fastScoreRel`: use relative vs absolute k-mer matching score
|
||||
|
||||
## Algorithm Overview
|
||||
|
||||
1. **Preprocessing & Initialization**
|
||||
- Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).
|
||||
|
||||
2. **Fast Overlap Estimation via 4-mer Indexing**
|
||||
- Builds a k-mer index of `seqA` using `obikmer.Index4mer`.
|
||||
- Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.
|
||||
- Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).
|
||||
|
||||
3. **Overlap Computation**
|
||||
- Determines overlap length `over` based on shift:
|
||||
```text
|
||||
over = |seqA| - shift if shift > 0
|
||||
|seqB| + shift if shift < 0
|
||||
min(|seqA|,|seqB)| otherwise
|
||||
```
|
||||
|
||||
4. **Dynamic Programming Alignment**
|
||||
- If overlap is *not* identical (`fastCount + 3 < over`):
|
||||
- Extracts subregions with `delta`-buffered boundaries.
|
||||
- Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.
|
||||
- Backtracks via `_Backtracking` to produce alignment path.
|
||||
- Else (near-perfect overlap):
|
||||
- Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.
|
||||
- Returns trivial path `[extra5, partLen]`.
|
||||
|
||||
## Output
|
||||
|
||||
Returns:
|
||||
|
||||
| Index | Type | Meaning |
|
||||
|-------|----------|---------|
|
||||
| 0️⃣ | `int` | Final alignment score (weighted by quality) |
|
||||
| 1️⃣ | `[]int` | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
|
||||
| 2️⃣ | `int` | K-mer match count (`fastCount`) |
|
||||
| 3️⃣ | `int` | Overlap length (`over`) |
|
||||
| 4️⃣ | `float64` | K-mer-based score (`fastScore`) |
|
||||
| 5️⃣ | `bool` | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
|
||||
|
||||
## Key Design Highlights
|
||||
|
||||
- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.
|
||||
- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.
|
||||
- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).
|
||||
- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
|
||||
@@ -0,0 +1,25 @@
|
||||
# Apat Package: Pattern Matching for Biological Sequences
|
||||
|
||||
The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
|
||||
|
||||
## Core Types
|
||||
|
||||
- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).
|
||||
- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
|
||||
|
||||
## Key Functions & Methods
|
||||
|
||||
- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.
|
||||
- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).
|
||||
- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.
|
||||
- `IsMatching(...)`: Boolean check for presence of at least one match in a region.
|
||||
- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.
|
||||
- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.
|
||||
- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).
|
||||
- `Free()`, `Len()`: Explicit memory cleanup and length queries.
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
|
||||
|
||||
This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
|
||||
@@ -0,0 +1,32 @@
|
||||
# Semantic Description of `obiapat` Package Functionality
|
||||
|
||||
The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**
|
||||
Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
|
||||
|
||||
- `pattern`: A string where:
|
||||
- Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
|
||||
- Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
|
||||
- Exclamation `!` marks positions where **errors** (substitutions) are permitted.
|
||||
- `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
|
||||
- `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
|
||||
|
||||
## Behavior & Semantics
|
||||
|
||||
- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
|
||||
- Supports three modes:
|
||||
- **Exact matching** (`errormax = 0`, `allowsIndel = false`).
|
||||
- **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
|
||||
- **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
|
||||
|
||||
## Testing Coverage
|
||||
|
||||
The provided test suite validates:
|
||||
- Valid pattern parsing across different configurations.
|
||||
- Correct handling of `nil` vs. non-nil output pointers.
|
||||
- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
|
||||
|
||||
In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
|
||||
@@ -0,0 +1,27 @@
|
||||
# PCR Simulation Module (`obiapat`)
|
||||
|
||||
This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
|
||||
|
||||
## Key Functionalities
|
||||
|
||||
- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
|
||||
- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
|
||||
- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
|
||||
- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
|
||||
- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
|
||||
- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
|
||||
|
||||
## Core API
|
||||
|
||||
- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
|
||||
- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
|
||||
- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
|
||||
|
||||
## Implementation Details
|
||||
|
||||
- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
|
||||
- Handles circular topology by wrapping indices around sequence boundaries.
|
||||
- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
|
||||
- Logs critical errors with `logrus`; debug-level details for amplicon generation.
|
||||
|
||||
Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
|
||||
@@ -0,0 +1,23 @@
|
||||
## Semantic Description of `IsPatternMatchSequence`
|
||||
|
||||
The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
|
||||
|
||||
### Core Functionality:
|
||||
- **Input Parameters**
|
||||
- `pattern`: A regular expression-like string describing the target pattern.
|
||||
- `errormax`: Maximum allowed mismatches (substitutions only by default).
|
||||
- `bothStrand`: If true, also search on the reverse-complement strand.
|
||||
- `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
|
||||
|
||||
- **Internal Workflow**
|
||||
- Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.
|
||||
- Computes its reverse complement for dual-strand matching.
|
||||
- Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
|
||||
|
||||
- **Matching Logic**
|
||||
- Converts input sequence to `apat` format.
|
||||
- Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.
|
||||
- Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
|
||||
|
||||
### Semantic Use Case:
|
||||
Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.
|
||||
@@ -0,0 +1,15 @@
|
||||
# `ISequenceChunk` Function — Semantic Description
|
||||
|
||||
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
|
||||
|
||||
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
|
||||
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
|
||||
- Optional parameters allow fine-tuning:
|
||||
- `dereplicate`: enables deduplication of identical sequences.
|
||||
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
|
||||
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
|
||||
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
|
||||
|
||||
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
|
||||
|
||||
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
|
||||
@@ -0,0 +1,18 @@
|
||||
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
|
||||
|
||||
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
|
||||
|
||||
Key capabilities include:
|
||||
|
||||
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
|
||||
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
|
||||
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
|
||||
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
|
||||
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
|
||||
|
||||
Internally, `ISequenceChunkOnDisk` uses:
|
||||
- `obiiter.MakeIBioSequence()` to build the output iterator,
|
||||
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
|
||||
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
|
||||
|
||||
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
|
||||
@@ -0,0 +1,21 @@
|
||||
# `ISequenceChunkOnMemory` Function — Semantic Description
|
||||
|
||||
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
|
||||
|
||||
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
|
||||
|
||||
Key features:
|
||||
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
|
||||
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
|
||||
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
|
||||
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
|
||||
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
|
||||
|
||||
Input:
|
||||
- An iterator (`obiiter.IBioSequence`) of raw sequences.
|
||||
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
|
||||
|
||||
Output:
|
||||
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
|
||||
|
||||
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
|
||||
@@ -0,0 +1,26 @@
|
||||
# Semantic Description of `obichunk` Package
|
||||
|
||||
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
|
||||
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
|
||||
|
||||
## Supported Functionalities
|
||||
|
||||
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
|
||||
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
|
||||
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
|
||||
- **Batch Processing Control**:
|
||||
- `OptionBatchCount(number)` sets the number of batches.
|
||||
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
|
||||
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
|
||||
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
|
||||
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Functional options pattern for extensibility and readability.
|
||||
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
|
||||
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
|
||||
@@ -0,0 +1,29 @@
|
||||
# Semantic Description of `obichunk.ISequenceSubChunk`
|
||||
|
||||
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**:
|
||||
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
|
||||
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
|
||||
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
|
||||
|
||||
- **Processing**:
|
||||
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
|
||||
- For each incoming `BioSequenceBatch`:
|
||||
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
|
||||
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
|
||||
- If the batch has ≤1 sequence, it’s passed through unchanged (but reordered with a new order ID).
|
||||
|
||||
- **Ordering Mechanism**:
|
||||
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
|
||||
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
|
||||
|
||||
- **Output**:
|
||||
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
|
||||
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
|
||||
|
||||
## Semantic Purpose
|
||||
|
||||
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
|
||||
@@ -0,0 +1,45 @@
|
||||
# Semantic Description of `IUniqueSequence` Functionality
|
||||
|
||||
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
|
||||
|
||||
## Core Workflow
|
||||
|
||||
1. **Input Processing**
|
||||
Accepts an input sequence iterator and optional configuration via `WithOption`.
|
||||
|
||||
2. **Parallelization Strategy**
|
||||
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
|
||||
|
||||
3. **Data Splitting Phase**
|
||||
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
|
||||
- Ensures deterministic chunking for reproducibility.
|
||||
|
||||
4. **Storage Choice**
|
||||
- *In-memory*: via `ISequenceChunkOnMemory`.
|
||||
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
|
||||
|
||||
5. **Uniqueness Classification**
|
||||
- Builds a composite classifier combining:
|
||||
- Sequence identity (`SequenceClassifier`)
|
||||
- Optional annotation categories (e.g., sample, primer), with NA handling.
|
||||
- If no annotations are specified, only raw sequence identity is used.
|
||||
|
||||
6. **Singleton Filtering**
|
||||
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
|
||||
|
||||
7. **Parallel Dereplication**
|
||||
- Spawns worker goroutines to process chunks.
|
||||
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
|
||||
|
||||
8. **Output Merging**
|
||||
- Aggregates results using `IMergeSequenceBatch`, preserving:
|
||||
- Sequence counts
|
||||
- Statistics (if enabled)
|
||||
- NA handling and ordering
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
|
||||
- **Configurable**: Via functional options (`Options`).
|
||||
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
|
||||
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
|
||||
@@ -0,0 +1,28 @@
|
||||
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
|
||||
|
||||
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
|
||||
|
||||
## Core Components
|
||||
|
||||
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
|
||||
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
|
||||
- Scans each sequence *forward* and its reverse complement.
|
||||
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
|
||||
- Attaches match counts as sequence attributes.
|
||||
|
||||
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
|
||||
Compiles a *single* matcher and returns a predicate function.
|
||||
- Returns `true` if the number of matches ≥ `minMatches`.
|
||||
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
|
||||
|
||||
## Technical Highlights
|
||||
|
||||
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
|
||||
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
|
||||
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
|
||||
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
|
||||
- Filtering or annotating sequences based on pattern presence/abundance.
|
||||
@@ -0,0 +1,34 @@
|
||||
# ObiDefault Package: Batch Configuration Module
|
||||
|
||||
This Go module provides centralized configuration for sequence batching in Obitools, supporting both **count-based** and **memory-aware** batch processing.
|
||||
|
||||
## Core Features
|
||||
|
||||
- `_BatchSize` / `SetBatchSize()`
|
||||
Defines and configures the *minimum* number of sequences per batch (default: `1`).
|
||||
Used internally as `minSeqs` in `RebatchBySize`.
|
||||
|
||||
- `_BatchSizeMax()` / `SetBatchSizeMax()`
|
||||
Sets the *maximum* sequences per batch (default: `2000`). Batches are flushed upon reaching this limit, regardless of memory.
|
||||
|
||||
- **CLI & Environment Integration**
|
||||
Batch size is determined by `--batch-size` CLI flag and/or the `OBIBATCHSIZE` environment variable (via parsing logic not shown here but implied by comments).
|
||||
|
||||
- `_BatchMem()` / `SetBatchMem(n int)`
|
||||
Configures the *maximum memory per batch* (default: `128 MB`). A value of `0` disables memory-based batching, falling back to pure count-based logic.
|
||||
|
||||
- `_BatchMemStr()`
|
||||
Stores the *raw CLI string* passed to `--batch-mem` (e.g., `"256M"`, `"1G"`), enabling human-readable input parsing elsewhere.
|
||||
|
||||
## Utility Functions
|
||||
|
||||
- `BatchSizePtr()`, `BatchMemPtr()`
|
||||
Expose pointers to internal variables for direct modification or inter-process sharing.
|
||||
|
||||
- `BatchSizeMaxPtr()`, `BatchMemStrPtr()`
|
||||
Provide read/write access to max-size and raw memory string values.
|
||||
|
||||
## Design Intent
|
||||
|
||||
- Separates **configuration** (defaults, CLI/env parsing) from **processing logic**, enabling modular and testable batch handling.
|
||||
- Supports both scalable, large-scale processing (via count limits) and memory-constrained environments (via soft RAM caps).
|
||||
@@ -0,0 +1,35 @@
|
||||
# Output Compression Control Module
|
||||
|
||||
This Go package (`obidefault`) provides a simple, global configuration mechanism for toggling output compression behavior across an application.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Global Compression Flag**: A package-level boolean variable `__compress__` (default: `false`) controls whether output should be compressed.
|
||||
- **Read Access**:
|
||||
- `CompressOutput()` returns the current compression setting as a boolean.
|
||||
- **Write Access**:
|
||||
- `SetCompressOutput(b bool)` updates the compression flag to a new value.
|
||||
- **Pointer Access**:
|
||||
- `CompressOutputPtr()` returns a pointer to the internal flag, enabling indirect modification (e.g., for UI bindings or reflection-based updates).
|
||||
|
||||
## Design Intent
|
||||
|
||||
- Minimal, side-effect-free API.
|
||||
- Thread-safety *not* guaranteed — intended for use in single-threaded initialization or controlled environments.
|
||||
- Encapsulation via unexported variable `__compress__`, enforced through accessor functions.
|
||||
|
||||
## Typical Usage
|
||||
|
||||
```go
|
||||
// Enable compression globally:
|
||||
obidefault.SetCompressOutput(true)
|
||||
|
||||
if obidefault.CompressOutput() {
|
||||
// Apply compression logic (e.g., gzip, brotli)
|
||||
}
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The double underscore prefix (`__compress__`) signals internal/private status (convention, not enforced).
|
||||
- Designed for runtime configurability without recompilation.
|
||||
@@ -0,0 +1,38 @@
|
||||
# `obidefault` Package — Semantic Overview
|
||||
|
||||
This minimal Go package provides a centralized, mutable global flag for controlling warning verbosity across an application.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`__silent_warning__`**:
|
||||
A package-level boolean variable (unexported) that determines whether warnings should be suppressed.
|
||||
|
||||
- **`SilentWarning() bool`**:
|
||||
A read-only accessor returning the current state of `__silent_warning__`. Enables safe, non-mutating checks elsewhere in the codebase.
|
||||
|
||||
- **`SilentWarningPtr() *bool`**:
|
||||
Returns a pointer to `__silent_warning__`, allowing external code (e.g., CLI parsers, config loaders) to directly mutate the flag — e.g., `*SilentWarningPtr() = true`.
|
||||
|
||||
## Design Intent
|
||||
|
||||
- **Simplicity & Centralization**:
|
||||
Avoids scattering warning-control logic; provides a single source of truth.
|
||||
|
||||
- **Flexibility**:
|
||||
Supports both *read-only* inspection (via `SilentWarning()`) and *global mutation* (via pointer), useful for early initialization phases.
|
||||
|
||||
- **Explicit Semantics**:
|
||||
When `SilentWarning()` returns `true`, all warning-generating code *should* suppress output (implementation responsibility lies outside this package).
|
||||
|
||||
## Usage Example
|
||||
|
||||
```go
|
||||
// Suppress warnings globally:
|
||||
*obidefault.SilentWarningPtr() = true
|
||||
|
||||
if !obidefault.SilentWarning() {
|
||||
log.Println("⚠️ Warning: something happened")
|
||||
}
|
||||
```
|
||||
|
||||
> **Note**: The double underscore prefix on `__silent_warning__` signals internal/private status, discouraging direct access.
|
||||
@@ -0,0 +1,33 @@
|
||||
# Progress Bar Control Module (`obidefault`)
|
||||
|
||||
This Go package provides a simple, global mechanism to enable or disable progress bar display across an application.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`ProgressBar()`**: Returns `true` if progress bars are *enabled* (i.e., when `__no_progress_bar__` is `false`).
|
||||
- **`NoProgressBar()`**: Returns the current state of `__no_progress_bar__`, i.e., whether progress bars are *disabled*.
|
||||
- **`SetNoProgressBar(b bool)`**: Sets the global flag `__no_progress_bar__`. Passing `true` disables progress bars; passing `false` enables them.
|
||||
- **`NoProgressBarPtr()`**: Returns a pointer to the internal `__no_progress_bar__` variable, allowing direct read/write access (e.g., for reflection or UI binding).
|
||||
|
||||
## Design Intent
|
||||
|
||||
- Centralizes progress bar visibility control in one place.
|
||||
- Supports both boolean query/set and pointer-based manipulation for flexibility (e.g., CLI flags, config binding).
|
||||
- Uses a *negative* flag name (`__no_progress_bar__`) internally to default progress bars **on** (i.e., `false` → enabled).
|
||||
|
||||
## Usage Example
|
||||
|
||||
```go
|
||||
// Disable progress bars globally:
|
||||
obidefault.SetNoProgressBar(true)
|
||||
|
||||
// Check status:
|
||||
if !obidefault.ProgressBar() {
|
||||
log.Println("Progress bars are disabled.")
|
||||
}
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Thread-safety is *not* guaranteed; concurrent access should be externally synchronized.
|
||||
- The double underscore prefix (`__no_progress_bar__`) signals internal/private usage per Go convention (though not enforced).
|
||||
@@ -0,0 +1,26 @@
|
||||
# Quality Shift and Read/Write Control Module
|
||||
|
||||
This Go package (`obidefault`) provides configurable controls over quality score handling in sequence data processing (e.g., FASTQ files). It defines three global variables and corresponding accessor/mutator functions:
|
||||
|
||||
- `_Quality_Shift_Input`: Input quality score offset (default: `33`, i.e., Phred+33/Sanger format).
|
||||
- `_Quality_Shift_Output`: Output quality score offset (default: `33`), allowing format conversion.
|
||||
- `_Read_Qualities`: Boolean flag indicating whether quality scores should be parsed/processed (`true` by default).
|
||||
|
||||
## Public API
|
||||
|
||||
| Function | Purpose |
|
||||
|---------|--------|
|
||||
| `SetReadQualitiesShift(shift byte)` | Sets the quality score offset for *input* data (e.g., when reading FASTQ). |
|
||||
| `ReadQualitiesShift() byte` | Returns the current input quality offset. |
|
||||
| `SetWriteQualitiesShift(shift byte)` | Sets the quality score offset for *output* data (e.g., when writing FASTQ). |
|
||||
| `WriteQualitiesShift() byte` | Returns the current output quality offset. |
|
||||
| `SetReadQualities(read bool)` | Enables/disables reading/processing of quality scores. |
|
||||
| `ReadQualities() bool` | Returns whether qualities are currently being read/used. |
|
||||
|
||||
## Semantic Use Cases
|
||||
|
||||
- **Format Interoperability**: Allows seamless conversion between Phred+33 (Sanger), Phred+64, or other quality encodings.
|
||||
- **Performance Optimization**: Disabling `ReadQualities` skips parsing of quality strings, useful when only sequences are needed.
|
||||
- **Centralized Configuration**: Global state enables consistent behavior across modules without passing parameters.
|
||||
|
||||
All functions are thread-unsafe by design—intended for initialization before concurrent processing begins.
|
||||
@@ -0,0 +1,21 @@
|
||||
# `obidefault` Package: Configuration State Management
|
||||
|
||||
This Go package provides a centralized, thread-safe(ish) configuration layer for taxonomy-related settings in the OBIDMS (Open Biological and Biomedical Data Management System) framework. It exposes simple getters, setters, and pointer accessors for four core boolean/string flags that control how taxonomic identifiers (taxids) are handled during data processing.
|
||||
|
||||
## Core Configuration Flags
|
||||
|
||||
- `__taxonomy__`: Stores the currently selected taxonomy (e.g., `"NCBI"`, `"UNIPROT"`).
|
||||
- `__alternative_name__`: Enables/disables use of alternative taxonomic names (e.g., synonyms).
|
||||
- `__fail_on_taxonomy__`: If true, processing halts on taxonomy mismatches/errors.
|
||||
- `__update_taxid__`: If true, taxids are auto-updated to current NCBI/DB versions.
|
||||
- `__raw_taxid__`: If true, raw (unprocessed) taxids are preserved instead of normalized.
|
||||
|
||||
## Public API
|
||||
|
||||
- **Getters**: `UseRawTaxids()`, `SelectedTaxonomy()`, `HasSelectedTaxonomy()`, etc., return current values.
|
||||
- **Pointer Accessors**: e.g., `SelectedTaxonomyPtr()` returns a pointer for direct mutation (advanced use).
|
||||
- **Setters**: `SetSelectedTaxonomy()`, `SetAlternativeNamesSelected()`, etc., update state.
|
||||
|
||||
## Use Case
|
||||
|
||||
Typically used at application startup to configure global behavior (e.g., `SetSelectedTaxonomy("NCBI")`, `SetUpdateTaxid(true)`), then referenced by downstream modules during data import, validation, or mapping. Minimalist and explicit—no external dependencies.
|
||||
@@ -0,0 +1,35 @@
|
||||
# Obidefault: Parallelism Configuration Module
|
||||
|
||||
This Go package (`obideault`) provides a centralized, configurable interface for managing parallel execution parameters—particularly useful in I/O- and CPU-bound workloads.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **CPU-aware defaults**: Automatically detects available cores via `runtime.NumCPU()`.
|
||||
- **Configurable workers per core**:
|
||||
- General: `_WorkerPerCore` (default `1.0`)
|
||||
- Read-specific: `_ReadWorkerPerCore` (`0.25`, i.e., ~1 reader per 4 cores)
|
||||
- Write-specific: `_WriteWorkerPerCore` (`0.25`)
|
||||
- **Strict overrides**: Allow hardcoding worker counts via `SetStrictReadWorker()`/`Write...`, bypassing per-core scaling.
|
||||
|
||||
## Public API
|
||||
|
||||
| Function | Purpose |
|
||||
|---------|--------|
|
||||
| `ParallelWorkers()` | Total workers = `MaxCPU() × WorkerPerCore` |
|
||||
| `Read/WriteParallelWorkers()` | Resolves to strict count if set, else per-core calculation (min 1) |
|
||||
| `ParallelFilesRead()` | Files read in parallel: defaults to `ReadParallelWorkers()`, overridable |
|
||||
| Getters (`MaxCPU`, `WorkerPerCore`, etc.) | Expose current settings safely |
|
||||
| Setters (`Set*`) | Dynamically adjust behavior at runtime |
|
||||
|
||||
## Configuration Sources
|
||||
|
||||
- **Command-line flags**: e.g., `--max-cpu` or `-m`
|
||||
- **Environment variable**: `OBIMAXCPU`
|
||||
|
||||
## Design Highlights
|
||||
|
||||
✅ Decouples resource discovery from policy
|
||||
✅ Supports both *proportional* (per-core) and *absolute* (strict) worker definitions
|
||||
✅ Ensures non-zero defaults for critical paths (`ReadParallelWorkers` ≥ 1)
|
||||
|
||||
⚠️ **Note**: `WriteParallelWorkers()` contains a likely bug—returns `_StrictReadWorker` in the else branch instead of `StrictWriteWorker`.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `obidist` Package: Efficient Symmetric Distance/Similarity Matrix Management
|
||||
|
||||
The `*DistMatrix` type provides a memory-efficient, symmetric matrix implementation for distance or similarity data.
|
||||
|
||||
- **Storage Strategy**: Only the upper triangle (i < j) is stored, reducing memory usage from *O(n²)* to *n(n−1)/2*.
|
||||
- **Diagonal Handling**: Diagonal entries are fixed (0.0 for distances, 1.0 for similarities); assignments to diagonal indices are silently ignored.
|
||||
- **Symmetry Guarantee**: `Get(i, j)` and `Set(i, j, v)` automatically handle both (i,j) and (j,i), ensuring consistency.
|
||||
|
||||
## Constructors
|
||||
|
||||
| Function | Description |
|
||||
|---------|-------------|
|
||||
| `NewDistMatrix(n)` / `WithLabels(labels)` | Creates *n×n* distance matrix (diag = 0). |
|
||||
| `NewSimilarityMatrix(n)` / `WithLabels(labels)` | Creates *n×n* similarity matrix (diag = 1). |
|
||||
|
||||
## Core Operations
|
||||
|
||||
- `Get(i, j)` / `Set(i, j, v)`: Access/update symmetric entries.
|
||||
- `Size() int`, `GetLabel(i)` / `SetLabel(i, label)`: Query/mutate element labels.
|
||||
- `Labels() []string`, `GetRow(i)` / `GetColumn(j)`: Retrieve full rows/columns (as copies).
|
||||
|
||||
## Analysis Helpers
|
||||
|
||||
- `MinDistance()`, `MaxDistance()` → `(value, i, j)` of the extremal off-diagonal entry.
|
||||
- `Copy() *DistMatrix`: Deep copy for immutability-safe operations.
|
||||
- `ToFullMatrix()` → `[][]float64`: Converts to dense representation (use sparingly).
|
||||
|
||||
Designed for clustering, phylogenetics, or any domain requiring fast symmetric matrix access with minimal footprint.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `obidist` Package: Semantic Feature Overview
|
||||
|
||||
The `obidist` Go package provides two core data structures for managing **distance** and **similarity matrices**, with built-in guarantees suitable for scientific computing (e.g., clustering, phylogenetics). Key features include:
|
||||
|
||||
- **`DistMatrix`**: A symmetric `n×n` matrix representing pairwise distances, where:
|
||||
- Diagonal entries are *always* `0.0` (self-distance).
|
||||
- Off-diagonals obey symmetry: `dist(i, j) == dist(j, i)`.
|
||||
- Automatic enforcement via dedicated `Set()`/`Get()` methods.
|
||||
|
||||
- **`SimilarityMatrix`**: A symmetric matrix where:
|
||||
- Diagonal entries are *always* `1.0`.
|
||||
- Off-diagonals represent similarity scores (e.g., between `0` and `1`, though not enforced).
|
||||
- Symmetry is similarly guaranteed.
|
||||
|
||||
Both matrix types support:
|
||||
- **Optional labels**: Associate human-readable identifiers (e.g., sample names) with rows/columns.
|
||||
- **Safe bounds checking**: Panics on out-of-range access (tested via `defer/recover`).
|
||||
- **Deep copy support**: Ensures isolation between original and copied instances.
|
||||
- **Utility methods**:
|
||||
- `MinDistance()` / `MaxDistance()`: Return extremal values and their indices.
|
||||
- `GetRow(i)`: Retrieve a full row as a slice (symmetric copy).
|
||||
- `ToFullMatrix()`: Export the matrix as an immutable 2D slice.
|
||||
|
||||
Edge cases are rigorously handled:
|
||||
- Empty (`n=0`) and singleton (`n=1`) matrices return `(0.0, -1, -1)` for min/max.
|
||||
- Label mutations do not affect internal state via defensive copying.
|
||||
|
||||
All behaviors are validated through comprehensive unit tests, emphasizing correctness and robustness.
|
||||
@@ -0,0 +1,43 @@
|
||||
# Semantic Description of `ReadSequencesBatchFromFiles`
|
||||
|
||||
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
|
||||
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
- Launches `concurrent_readers` goroutines to process files in parallel.
|
||||
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
|
||||
|
||||
## Streaming Interface
|
||||
|
||||
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
|
||||
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
|
||||
|
||||
## Error Handling & Logging
|
||||
|
||||
- Panics on file-open failure (via `log.Panicf`).
|
||||
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
|
||||
|
||||
## Resource Management
|
||||
|
||||
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
|
||||
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
|
||||
|
||||
## Design Intent
|
||||
|
||||
- Enables scalable, memory-efficient ingestion of large NGS datasets.
|
||||
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
|
||||
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Type/Interface | Role |
|
||||
|----------------|------|
|
||||
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
|
||||
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
|
||||
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
|
||||
|
||||
@@ -0,0 +1,36 @@
|
||||
# `obiformats` Package — Semantic Overview
|
||||
|
||||
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
|
||||
|
||||
## Core Abstraction
|
||||
|
||||
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
|
||||
```go
|
||||
func(string, ...WithOption) (obiiter.IBioSequence, error)
|
||||
```
|
||||
- It accepts:
|
||||
- A file path (`string`)
|
||||
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
|
||||
- Returns:
|
||||
- An iterator over biological sequences (`obiiter.IBioSequence`)
|
||||
- Or an error if the file cannot be opened/parsed
|
||||
|
||||
## Semantic Intent
|
||||
|
||||
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
|
||||
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
|
||||
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
|
||||
|
||||
## Typical Usage Pattern
|
||||
|
||||
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
|
||||
2. Call it with a file path and optional options.
|
||||
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Functional, minimal API**: Single responsibility—reading and iteration.
|
||||
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
|
||||
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
|
||||
|
||||
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
|
||||
@@ -0,0 +1,30 @@
|
||||
# CSV Import Module for Biological Sequences (`obiformats`)
|
||||
|
||||
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
|
||||
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
|
||||
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
|
||||
- **Metadata Handling**:
|
||||
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
|
||||
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
|
||||
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
|
||||
- **Multiple Entry Points**:
|
||||
- `ReadCSV`: From any `io.Reader`.
|
||||
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
|
||||
- `ReadCSVFromStdin`: Reads from standard input.
|
||||
- **Error & Edge Handling**:
|
||||
- Gracefully handles empty files/streams via `ReadEmptyFile`.
|
||||
- Uses structured logging (Logrus) for fatal and informational messages.
|
||||
|
||||
## Integration
|
||||
|
||||
Designed to integrate with OBItools4’s core types:
|
||||
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
|
||||
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
|
||||
|
||||
## Use Case
|
||||
|
||||
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
|
||||
@@ -0,0 +1,22 @@
|
||||
# CSVSequenceRecord Function Description
|
||||
|
||||
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
|
||||
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
|
||||
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
|
||||
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
|
||||
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
|
||||
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
|
||||
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
|
||||
- Handles missing data consistently via `opt.CSVNAValue()`.
|
||||
- Supports both standard and user-defined metadata fields.
|
||||
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
|
||||
|
||||
This function enables interoperable, configurable export of sequence data to tabular formats.
|
||||
@@ -0,0 +1,24 @@
|
||||
# `CSVTaxaIterator` Function — Semantic Description
|
||||
|
||||
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
|
||||
|
||||
### Core Functionality:
|
||||
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
|
||||
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
|
||||
|
||||
### Configurable Output Fields (via options):
|
||||
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
|
||||
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
|
||||
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
|
||||
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
|
||||
- `scientific_name`: Full scientific name of the taxon.
|
||||
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
|
||||
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
|
||||
|
||||
### Implementation Highlights:
|
||||
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
|
||||
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
|
||||
- Dynamically builds CSV headers based on selected options before processing begins.
|
||||
|
||||
### Use Case:
|
||||
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
|
||||
@@ -0,0 +1,27 @@
|
||||
## CSV Taxonomy Loader for OBITools4
|
||||
|
||||
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
|
||||
|
||||
### Key Features:
|
||||
- **Robust CSV Parsing**: Uses Go’s `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
|
||||
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
|
||||
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
|
||||
- **Taxonomy Construction**:
|
||||
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
|
||||
- Ensures existence of a root node; returns error otherwise.
|
||||
- **Metadata Extraction**:
|
||||
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
|
||||
- Logs key metadata for traceability.
|
||||
- **Scalable Design**:
|
||||
- Processes records line-by-line (memory-efficient).
|
||||
- Supports large datasets via streaming CSV reading.
|
||||
|
||||
### Input Format:
|
||||
CSV must contain exactly four columns (case-sensitive headers):
|
||||
- `taxid`: Unique taxon identifier.
|
||||
- `parent`: Parent taxonomic node ID (empty for root).
|
||||
- `scientific_name`: Binomial or descriptive name.
|
||||
- `taxonomic_rank`: e.g., *species*, *genus*.
|
||||
|
||||
### Output:
|
||||
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obiformats.WriterDispatcher`
|
||||
|
||||
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
|
||||
|
||||
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
|
||||
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
|
||||
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
|
||||
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
|
||||
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
|
||||
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
|
||||
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
|
||||
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
|
||||
|
||||
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
|
||||
@@ -0,0 +1,29 @@
|
||||
# EcoPCR File Parser for Biological Sequences
|
||||
|
||||
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
|
||||
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
|
||||
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
|
||||
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
|
||||
- Name (with deduplication support)
|
||||
- Nucleotide/protein sequence
|
||||
- Comment field
|
||||
- **Structured Annotation**: Populates rich annotations including:
|
||||
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
|
||||
- Primer matching info (`forward_match`, `reverse_mismatch`)
|
||||
- Melting temperatures (if present in v2)
|
||||
- Amplicon length and strand orientation
|
||||
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
|
||||
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- Custom line reader (`__readline__`) for robust header parsing.
|
||||
- CSV parser configured with `|` delimiter and comment support (`#`).
|
||||
- Deduplication of sequence names using a running count suffix.
|
||||
- Concurrent goroutine-based streaming to decouple I/O and processing.
|
||||
|
||||
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
|
||||
@@ -0,0 +1,17 @@
|
||||
# EMBL Format Parser for OBITools4
|
||||
|
||||
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
|
||||
|
||||
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
|
||||
- **Two Parsing Modes**:
|
||||
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
|
||||
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
|
||||
- **Configurable Options**:
|
||||
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
|
||||
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
|
||||
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
|
||||
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
|
||||
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
|
||||
- **Integration**: Outputs are compatible with OBITools4’s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
|
||||
|
||||
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
|
||||
@@ -0,0 +1,22 @@
|
||||
## `ReadEmptyFile` Function — Semantic Description
|
||||
|
||||
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
|
||||
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
|
||||
- **Signature**:
|
||||
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
|
||||
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
|
||||
- **Behavior**:
|
||||
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
|
||||
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
|
||||
- **Output**:
|
||||
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
|
||||
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
|
||||
|
||||
### Semantic Role & Use Cases
|
||||
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
|
||||
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
|
||||
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
|
||||
|
||||
### Design Notes
|
||||
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
|
||||
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
|
||||
@@ -0,0 +1,34 @@
|
||||
# FASTA Parser Module (`obiformats`)
|
||||
|
||||
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **`FastaChunkParser(UtoT bool)`**
|
||||
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
|
||||
|
||||
- **`FastaChunkParserRope(...)`**
|
||||
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
|
||||
|
||||
- **`ReadFasta(reader io.Reader, ...)`**
|
||||
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
|
||||
|
||||
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
|
||||
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
|
||||
|
||||
- **`EndOfLastFastaEntry(...)`**
|
||||
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
|
||||
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
|
||||
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
|
||||
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
|
||||
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
|
||||
- Graceful error reporting with context (source, identifier, invalid char position).
|
||||
- Extensible via `WithOption` pattern for header parsing and batching behavior.
|
||||
@@ -0,0 +1,41 @@
|
||||
# FASTQ Parsing Module (`obiformats`)
|
||||
|
||||
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **`EndOfLastFastqEntry(buffer []byte) int`**
|
||||
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
|
||||
|
||||
- **`FastqChunkParser(...)`**
|
||||
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
|
||||
- Header parsing (`@id [definition]`)
|
||||
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
|
||||
- Quality score shifting (`quality_shift`)
|
||||
- Strict validation (e.g., `+` line, matching sequence/length)
|
||||
|
||||
- **`FastqChunkParserRope(...)`**
|
||||
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
|
||||
|
||||
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
|
||||
Enables concurrent, chunked parsing of large files:
|
||||
- Splits input into chunks using `ReadFileChunk`
|
||||
- Uses configurable parallel workers (`nworker`)
|
||||
- Pushes parsed batches to an iterator interface
|
||||
|
||||
- **Convenience I/O Wrappers**
|
||||
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
|
||||
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
|
||||
|
||||
## Key Options & Features
|
||||
|
||||
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
|
||||
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
|
||||
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
|
||||
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
|
||||
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
|
||||
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
|
||||
@@ -0,0 +1,11 @@
|
||||
## Semantic Description of `obiformats` Package
|
||||
|
||||
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
|
||||
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
|
||||
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
|
||||
|
||||
Two main constructor functions enable flexible formatting:
|
||||
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
|
||||
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
|
||||
|
||||
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Semantic Description of `obiformats` Package
|
||||
|
||||
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
|
||||
|
||||
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
|
||||
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
|
||||
|
||||
## Core Functions
|
||||
|
||||
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
|
||||
Dynamically routes header parsing based on the first character of the sequence definition:
|
||||
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
|
||||
- Otherwise invokes `ParseFastSeqOBIHeader`.
|
||||
|
||||
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
|
||||
Applies header parsing to a *batch* of sequences:
|
||||
- Takes an iterator over `BioSequence`s.
|
||||
- Uses optional configuration (e.g., parallelism, parsing behavior).
|
||||
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Format agnosticism**: Automatically detects header type.
|
||||
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
|
||||
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
|
||||
|
||||
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `FormatHeader` Function Type in `obiformats`
|
||||
|
||||
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
|
||||
|
||||
- **Package**: `obiformats`
|
||||
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
|
||||
|
||||
- **Type Definition**:
|
||||
```go
|
||||
type FormatHeader func(sequence *obiseq.BioSequence) string
|
||||
```
|
||||
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
|
||||
|
||||
- **Semantic Role**:
|
||||
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
|
||||
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
|
||||
|
||||
- **Usage Context**:
|
||||
- Used by writers/formatters to produce standardized headers when exporting sequences.
|
||||
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
|
||||
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
|
||||
|
||||
- **Dependencies**:
|
||||
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
|
||||
|
||||
- **Design Intent**:
|
||||
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
|
||||
Facilitates extensibility for new output formats without modifying core types.
|
||||
@@ -0,0 +1,21 @@
|
||||
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
|
||||
|
||||
- **JSON Parsing Helpers**:
|
||||
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
|
||||
|
||||
- **Header Interpretation**:
|
||||
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
|
||||
- Core fields (`id`, `definition`, `count`)
|
||||
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
|
||||
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
|
||||
|
||||
- **Sequence Annotation Enrichment**:
|
||||
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
|
||||
|
||||
- **Serialization Support**:
|
||||
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
|
||||
|
||||
- **Error Handling**:
|
||||
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
|
||||
|
||||
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
|
||||
@@ -0,0 +1,31 @@
|
||||
# OBIFormats Package: Semantic Description
|
||||
|
||||
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
|
||||
|
||||
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
|
||||
- Three core parsing functions detect value types:
|
||||
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
|
||||
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
|
||||
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
|
||||
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
|
||||
|
||||
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
|
||||
|
||||
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
|
||||
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
|
||||
- Numeric values are stored as integers if they have no fractional part.
|
||||
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
|
||||
- `*_count` → `map[string]int`,
|
||||
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
|
||||
- `*_status`/`*_mutation` → `map[string]string`.
|
||||
|
||||
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequence’s definition line, moving annotations into its metadata map and preserving leftover text.
|
||||
|
||||
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
|
||||
- Strings and booleans use `key=value;`.
|
||||
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
|
||||
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
|
||||
|
||||
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
|
||||
|
||||
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
|
||||
@@ -0,0 +1,26 @@
|
||||
# FastSeq Reader Module — Semantic Description
|
||||
|
||||
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
|
||||
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
|
||||
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
|
||||
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
|
||||
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
|
||||
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
|
||||
- **Stdin & File I/O**: Two entry points:
|
||||
- `ReadFastSeqFromFile(filename, ...)` for regular files.
|
||||
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
|
||||
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
|
||||
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
|
||||
|
||||
## Integration
|
||||
|
||||
Built on top of `obitools4`’s core abstractions:
|
||||
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
|
||||
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
|
||||
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
|
||||
|
||||
Designed for scalability in high-throughput metabarcoding pipelines.
|
||||
@@ -0,0 +1,35 @@
|
||||
# `obiformats` Package Overview
|
||||
|
||||
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
|
||||
|
||||
## Core Formatting Functions
|
||||
|
||||
- **`FormatFasta(seq, formater)`**
|
||||
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
|
||||
|
||||
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
|
||||
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
|
||||
|
||||
## File Writing Functions
|
||||
|
||||
- **`WriteFasta(iterator, file, options...)`**
|
||||
Writes a stream of sequences to any `io.WriteCloser`. Supports:
|
||||
- Parallel workers (`ParallelWorkers`)
|
||||
- Chunked writing via `WriteFileChunk`
|
||||
- Optional compression (e.g., gzip)
|
||||
Returns a new iterator mirroring the input for pipeline chaining.
|
||||
|
||||
- **`WriteFastaToStdout(iterator, options...)`**
|
||||
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
|
||||
|
||||
- **`WriteFastaToFile(iterator, filename, options...)`**
|
||||
Writes to a named file with:
|
||||
- Truncation or append mode (`AppendFile`)
|
||||
- Automatic paired-end output if `HaveToSavePaired()` is enabled
|
||||
(writes reverse reads to a secondary file specified via `PairedFileName`)
|
||||
|
||||
## Key Design Highlights
|
||||
|
||||
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
|
||||
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
|
||||
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
|
||||
@@ -0,0 +1,35 @@
|
||||
# FASTQ Output Module (`obiformats`)
|
||||
|
||||
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
|
||||
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
|
||||
|
||||
## Header Customization
|
||||
|
||||
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
|
||||
|
||||
## Writing to Streams/Files
|
||||
|
||||
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
|
||||
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
|
||||
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
|
||||
- Append/truncate modes
|
||||
- Paired-end output (splits iterator and writes to two files)
|
||||
- Automatic compression via `obiutils.CompressStream`
|
||||
|
||||
## Parallelization & Robustness
|
||||
|
||||
- Uses goroutines to parallelize formatting/writing across multiple workers.
|
||||
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
|
||||
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
|
||||
|
||||
## Integration
|
||||
|
||||
Designed to work seamlessly with the `obitools4` ecosystem:
|
||||
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
|
||||
- Extensible through functional options (`WithOption`) for configuration.
|
||||
|
||||
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
|
||||
@@ -0,0 +1,19 @@
|
||||
# `obiformats` Package Overview
|
||||
|
||||
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
|
||||
|
||||
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
|
||||
|
||||
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
|
||||
|
||||
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
|
||||
|
||||
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
|
||||
|
||||
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
|
||||
|
||||
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
|
||||
|
||||
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
|
||||
|
||||
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
|
||||
@@ -0,0 +1,25 @@
|
||||
# Semantic Description of `obiformats` Package Functionalities
|
||||
|
||||
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
|
||||
|
||||
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
|
||||
|
||||
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
|
||||
|
||||
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
|
||||
|
||||
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
|
||||
|
||||
- **`ReadFileChunk()`**: Core function that:
|
||||
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
|
||||
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
|
||||
- Extends chunks incrementally (e.g., +1 MB) until a full record boundary is found via `splitter`;
|
||||
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
|
||||
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
|
||||
|
||||
- **Key semantics**:
|
||||
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
|
||||
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
|
||||
- *Streaming-first design* — supports large files without full loading into memory.
|
||||
|
||||
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
|
||||
@@ -0,0 +1,26 @@
|
||||
# `WriteFileChunk` Function — Semantic Description
|
||||
|
||||
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
|
||||
|
||||
- **Input**:
|
||||
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
|
||||
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
|
||||
|
||||
- **Core Behavior**:
|
||||
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
|
||||
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
|
||||
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
|
||||
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
|
||||
|
||||
- **Buffer Management**:
|
||||
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
|
||||
|
||||
- **Error Handling**:
|
||||
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
|
||||
|
||||
- **Cleanup & Lifecycle**:
|
||||
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
|
||||
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
|
||||
|
||||
- **Use Case**:
|
||||
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
|
||||
@@ -0,0 +1,34 @@
|
||||
# GenBank Parser Module (`obiformats`)
|
||||
|
||||
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
|
||||
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
|
||||
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
|
||||
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
|
||||
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
|
||||
- **Parallel streaming I/O**:
|
||||
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
|
||||
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
|
||||
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
|
||||
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
|
||||
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
|
||||
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
|
||||
|
||||
## Output
|
||||
|
||||
Returns a batched iterator of `BioSequence` objects, each containing:
|
||||
- Identifier (`id`)
|
||||
- Compact nucleotide sequence
|
||||
- Definition line (as description)
|
||||
- Source file origin
|
||||
- Optional feature table bytes
|
||||
- Annotations: `scientific_name`, `taxid`
|
||||
|
||||
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
|
||||
@@ -0,0 +1,27 @@
|
||||
# JSON Output Module for Biological Sequences (`obiformats`)
|
||||
|
||||
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
|
||||
|
||||
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
|
||||
- `"id"`: Sequence identifier.
|
||||
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
|
||||
- `"qualities"` (optional): Quality scores as a string if available.
|
||||
- `"annotations"` (optional): Metadata annotations map.
|
||||
|
||||
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
|
||||
|
||||
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
|
||||
- Parallel workers (configurable via options).
|
||||
- Automatic compression (`gzip`/`bgzip`) if enabled.
|
||||
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
|
||||
- Atomic ordering to preserve sequence integrity across parallel writes.
|
||||
|
||||
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
|
||||
- Outputs to stdout or a file (with append/truncate control).
|
||||
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
|
||||
|
||||
- **Internal helpers**:
|
||||
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9` → `\u00E9`).
|
||||
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
|
||||
|
||||
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
|
||||
@@ -0,0 +1,17 @@
|
||||
# NCBI Taxonomy Loader Module (`obiformats`)
|
||||
|
||||
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
|
||||
|
||||
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
|
||||
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
|
||||
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
|
||||
|
||||
Key features:
|
||||
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
|
||||
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
|
||||
- Efficient buffered reading (`bufio.Reader`) for large files.
|
||||
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
|
||||
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
|
||||
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
|
||||
|
||||
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
|
||||
@@ -0,0 +1,31 @@
|
||||
## NCBI Taxonomy Archive Support in `obiformats`
|
||||
|
||||
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
1. **Archive Validation (`IsNCBITarTaxDump`)**
|
||||
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
|
||||
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
|
||||
|
||||
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
|
||||
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
|
||||
- Steps include:
|
||||
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
|
||||
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
|
||||
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
|
||||
- Sets the root taxon to NCBI’s default (`taxid = 1`, i.e., *root*).
|
||||
|
||||
3. **Integration with Other Modules**
|
||||
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
|
||||
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
|
||||
|
||||
### Key Parameters
|
||||
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
|
||||
- `seqAsTaxa`: Reserved for future use; currently unused.
|
||||
|
||||
### Logging & Error Handling
|
||||
- Uses `logrus` to log loading progress and counts.
|
||||
- Returns descriptive errors if required files or the root taxon are missing.
|
||||
|
||||
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
|
||||
@@ -0,0 +1,31 @@
|
||||
# Newick Format Export Functionality in `obiformats`
|
||||
|
||||
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
|
||||
|
||||
## Core Components
|
||||
|
||||
- `Tree`: A struct modeling a node in a Newick tree, containing:
|
||||
- `Children`: list of child nodes (nested trees),
|
||||
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
|
||||
- `Length`: optional branch length (evolutionary distance).
|
||||
|
||||
- **`Newick()` methods**:
|
||||
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
|
||||
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
|
||||
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
|
||||
|
||||
- **Writing Functions**:
|
||||
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
|
||||
- Accepts an iterator over taxa (`*obitax.ITaxon`).
|
||||
- Validates single-taxonomy input.
|
||||
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
|
||||
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
|
||||
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
|
||||
|
||||
## Configuration Options
|
||||
|
||||
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
|
||||
|
||||
## Semantic Summary
|
||||
|
||||
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
|
||||
@@ -0,0 +1,47 @@
|
||||
# NGSFilter Configuration Parser — Semantic Overview
|
||||
|
||||
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Format Detection**:
|
||||
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
|
||||
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
|
||||
|
||||
- **Dual Input Parsing**:
|
||||
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
|
||||
- Primer pairs (`forward`, `reverse`)
|
||||
- Tag pairs (with optional `-` for untagged direction)
|
||||
- Experiment/sample metadata
|
||||
- OBIFeatures annotations (via `ParseOBIFeatures`)
|
||||
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
|
||||
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
|
||||
Additional columns are stored as annotations.
|
||||
|
||||
- **Parameter Configuration**:
|
||||
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
|
||||
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
|
||||
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
|
||||
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
|
||||
- Error tolerance:
|
||||
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
|
||||
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
|
||||
- Indel handling:
|
||||
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
|
||||
|
||||
- **Validation & Integrity Checks**:
|
||||
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
|
||||
- Duplicate tag-pair detection per marker (error on reuse).
|
||||
- Strict column/field validation with informative error messages.
|
||||
|
||||
- **Logging & Observability**:
|
||||
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Extensibility**: New parameters can be added via `library_parameter` map.
|
||||
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
|
||||
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
|
||||
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
|
||||
|
||||
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obiformats` Package Functionalities
|
||||
|
||||
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
|
||||
|
||||
Key capabilities include:
|
||||
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
|
||||
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
|
||||
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
|
||||
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
|
||||
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
|
||||
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
|
||||
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
|
||||
|
||||
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
|
||||
@@ -0,0 +1,27 @@
|
||||
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
|
||||
|
||||
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`newRopeScanner(rope *PieceOfChunk)`**
|
||||
Constructs a new scanner starting at the root of the rope.
|
||||
|
||||
- **`ReadLine() []byte`**
|
||||
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
|
||||
- Returns `nil` when the end of the rope is reached.
|
||||
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
|
||||
- The returned slice aliases rope data and is only valid until the next call.
|
||||
|
||||
- **`skipToNewline()`**
|
||||
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
|
||||
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
|
||||
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the rope’s underlying data.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
|
||||
@@ -0,0 +1,34 @@
|
||||
# Taxonomy Loading Module (`obiformats`)
|
||||
|
||||
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
|
||||
|
||||
## Core Features
|
||||
|
||||
1. **Format Detection**
|
||||
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
|
||||
- Supports:
|
||||
• NCBI Taxdump (both directory and `.tar` archive)
|
||||
• CSV files (`text/csv`)
|
||||
• FASTA/FASTQ sequences (via `mimetype` detection)
|
||||
|
||||
2. **Modular Loaders**
|
||||
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
|
||||
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
|
||||
|
||||
3. **Sequence-Based Taxonomy Extraction**
|
||||
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
|
||||
|
||||
4. **Integration with OBITools Ecosystem**
|
||||
- Leverages `obitax.Taxonomy` as the canonical output structure.
|
||||
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
|
||||
|
||||
5. **Error Handling & Logging**
|
||||
- Graceful failure with descriptive errors; informative logging via `logrus`.
|
||||
|
||||
## Usage Flow
|
||||
|
||||
```go
|
||||
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
|
||||
```
|
||||
|
||||
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
|
||||
@@ -0,0 +1,26 @@
|
||||
# OBIFORMATS Package: Semantic Description
|
||||
|
||||
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
|
||||
|
||||
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
|
||||
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
|
||||
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
|
||||
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
|
||||
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
|
||||
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
|
||||
- **CSV** (`text/csv`): generic tabular support.
|
||||
|
||||
Core functionality is exposed through:
|
||||
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
|
||||
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
|
||||
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
|
||||
|
||||
Internally leverages:
|
||||
- `obiutils.Ropen()` for unified file opening (including stdin handling).
|
||||
- Path extension stripping and source tagging via `OptionsSource()`.
|
||||
- Logging (`logrus`) for format diagnostics.
|
||||
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
|
||||
|
||||
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
|
||||
|
||||
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
|
||||
@@ -0,0 +1,29 @@
|
||||
# `obiformats` Package: Sequence Writing Utilities
|
||||
|
||||
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`WriteSequence()`**:
|
||||
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
|
||||
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
|
||||
- Preserves iterator state via `PushBack()` to allow chaining.
|
||||
|
||||
- **`WriteSequencesToStdout()`**:
|
||||
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
|
||||
|
||||
- **`WriteSequencesToFile()`**:
|
||||
Writes sequences to a specified file. Supports:
|
||||
- File creation/truncation or append mode (`OptionAppendFile()`).
|
||||
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
|
||||
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
|
||||
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
|
||||
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
|
||||
|
||||
## Integration
|
||||
|
||||
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
|
||||
@@ -0,0 +1,13 @@
|
||||
## Uint128 Type in `obifp`: Semantic Overview
|
||||
|
||||
This Go package defines a custom 128-bit unsigned integer type (`Uint128`) composed of two `uint64` limbs (high and low). It provides comprehensive arithmetic, comparison, bitwise operations, and type conversions.
|
||||
|
||||
- **Basic Constructors**: `Zero()`, `MaxValue()` initialize the smallest/largest possible values.
|
||||
- **State Checks**: `IsZero()`, and equality/comparison methods (`Equals`, `Cmp`, `<`, `>`, etc.) enable conditional logic.
|
||||
- **Type Casting**: Safe conversions to/from smaller (`Uint64`, `uint64`) and larger (`Uint256`) integer types, with overflow warnings where applicable.
|
||||
- **Arithmetic**: Full support for addition (`Add`, `Add64`), subtraction (`Sub`), multiplication (`Mul`, `Mul64`) — with panic on overflow.
|
||||
- **Division & Modulo**: Integer division (`Div`, `Div64`) and remainder (`Mod`, `Mod64`), implemented via optimized quotient-remainder pairs (`QuoRem`, `QuoRem64`) using hardware-assisted 64-bit operations.
|
||||
- **Bit Manipulation**: Left/right shifts (`LeftShift`, `RightShift`), and bitwise logic: AND, OR, XOR, NOT.
|
||||
- **Utility**: Direct access to low limb via `AsUint64()`.
|
||||
|
||||
All operations preserve 128-bit precision, with strict overflow checking for correctness in high-precision contexts (e.g., bioinformatics counting).
|
||||
@@ -0,0 +1,17 @@
|
||||
# `obifp.Uint128` Package — Semantic Feature Overview
|
||||
|
||||
This Go package provides a 128-bit unsigned integer type (`Uint128`) with comprehensive arithmetic, comparison, and bitwise operations. Internally represented as two `uint64` limbs (`w1`: high, `w0`: low), it supports:
|
||||
|
||||
- **Arithmetic Operations**
|
||||
- `Add`, `Sub`, `Mul` (128×128), and `Mul64` (scalar multiplication)
|
||||
- Division: `Div`, `Mod`, and combined quotient/remainder via `QuoRem` (and their 64-bit variants)
|
||||
- **Comparison & Equality**
|
||||
- `Cmp`, `Equals`, `LessThan`/`GreaterThan`, and their inclusive variants (`≤`, `≥`)
|
||||
- Support for comparing against both `Uint128` and native `uint64` values
|
||||
- **Bitwise Operations**
|
||||
- Logical AND (`And`), OR (`Or`), XOR (`Xor`) between two `Uint128`s
|
||||
- Bitwise NOT (`Not`) — inverts all bits of the value
|
||||
- **Conversion & Utility**
|
||||
- `AsUint64()` safely truncates to lower 64 bits (assumes upper limb is zero)
|
||||
|
||||
All operations handle overflow/underflow correctly, including carry propagation in addition and borrow handling in subtraction. Tests cover edge cases: zero values, max `uint64` boundaries (e.g., wrapping in addition/subtraction), and large multiplications. Designed for cryptographic or high-precision numeric use where native integer types are insufficient.
|
||||
@@ -0,0 +1,30 @@
|
||||
# Uint256 Type and Operations — Semantic Overview
|
||||
|
||||
The `obifp` package provides a custom 256-bit unsigned integer type (`Uint256`) implemented in Go, composed of four 64-bit limbs (`w0` to `w3`). It supports arithmetic, comparison, bitwise operations, and safe casting with overflow detection.
|
||||
|
||||
- **Core Representation**: `Uint256` stores values as four 64-bit words, enabling arbitrary-precision unsigned integers up to $2^{256} - 1$.
|
||||
|
||||
- **Utility Methods**:
|
||||
- `Zero()` / `MaxValue()`: Return the neutral and maximum values.
|
||||
- `IsZero()`, `Equals(v)`, comparison methods (`LessThan`, etc.): Enable logical and ordering checks.
|
||||
|
||||
- **Casting & Conversion**:
|
||||
- `Uint64()`, `Uint128()` downcast with warnings on overflow.
|
||||
- `Set64(v)`: Initializes from a standard `uint64`.
|
||||
- `AsUint64()`: Direct access to least-significant limb.
|
||||
|
||||
- **Bitwise Operations**:
|
||||
- `And`, `Or`, `Xor`, `Not`: Standard bitwise logic per limb.
|
||||
|
||||
- **Shifts**:
|
||||
- `LeftShift(n)` / `RightShift(n)`: Multi-limb shifts with carry propagation.
|
||||
|
||||
- **Arithmetic**:
|
||||
- `Add(v)`, `Sub(v)` / `Mul(v)`: Use Go’s `math/bits` for carry-aware operations; panic on overflow.
|
||||
- `Div(v)`: Implements long division via repeated subtraction of shifted multiples; panics on zero divisor.
|
||||
|
||||
- **Safety & Logging**:
|
||||
- Warnings via `obilog.Warnf` for silent overflows during narrowing casts.
|
||||
- Panics on arithmetic overflow or division-by-zero using `log.Panicf`.
|
||||
|
||||
This type is suitable for cryptographic, genomic (OBITools), or high-precision counting use cases requiring precise control over large unsigned integers.
|
||||
@@ -0,0 +1,34 @@
|
||||
# Uint64 Type Functionalities Overview
|
||||
|
||||
The `obifp` package provides a custom `Uint64` type wrapping Go’s native 64-bit unsigned integer (`uint64`) to support arithmetic, bitwise operations, and type conversions in a structured way.
|
||||
|
||||
## Core Operations
|
||||
|
||||
- **`Zero()` / `MaxValue()`**: Returns the zero and maximum representable values, respectively.
|
||||
- **`IsZero()` / `Equals(v)`**: Checks if the value is zero or equal to another.
|
||||
- **`Cmp(v)`, `LessThan(v)`**, etc.: Standard comparison operations returning `-1/0/+1` or boolean results.
|
||||
|
||||
## Arithmetic with Overflow Detection
|
||||
|
||||
- **Add/Sub/Mul**: Performs 64-bit addition, subtraction, and multiplication.
|
||||
- Uses `math/bits` for low-level operations (`bits.Add64`, etc.).
|
||||
- Panics on overflow (carry ≠ 0), enforcing strict safety.
|
||||
|
||||
## Bitwise Operations
|
||||
|
||||
- **`And`, `Or`, `Xor`, `Not()`**: Standard bitwise logic operations.
|
||||
- **`LeftShift(n)` / `RightShift(n)`**:
|
||||
- Shifts bits left/right by *n* positions.
|
||||
- Uses internal `LeftShift64`/`RightShift64`, supporting *carry-in* for multi-word arithmetic.
|
||||
|
||||
## Extended Precision Conversions
|
||||
|
||||
- **`Uint128()` / `Uint256()`**: Casts the 64-bit value into larger unsigned integer types (zero-extended).
|
||||
- **`Set64(v)`**: Reassigns the internal value from a raw `uint64`.
|
||||
|
||||
## Utility & Logging
|
||||
|
||||
- **`AsUint64()`**: Extracts the underlying `uint64`.
|
||||
- **Warning on overflow in shift operations** (e.g., shifts ≥ 128 bits) via `obilog.Warnf`.
|
||||
|
||||
> Designed for use in high-precision or cryptographic contexts where explicit overflow handling and type safety are critical.
|
||||
@@ -0,0 +1,32 @@
|
||||
# Obifp Package: Generic Fixed-Point Unsigned Integer Operations
|
||||
|
||||
This Go package (`obifp`) provides a generic, type-safe interface for fixed-point unsigned integer arithmetic over three size variants: `Uint64`, `Uint128`, and `Uint256`.
|
||||
|
||||
## Core Interface: `FPUint[T]`
|
||||
|
||||
The interface defines a unified API for unsigned integer types, supporting:
|
||||
|
||||
- **Initialization & Conversion**:
|
||||
- `Zero()`, `Set64(v)`: Create zero or set from a `uint64`.
|
||||
- `AsUint64()`: Downcast to standard `uint64`.
|
||||
|
||||
- **Logical Operations**:
|
||||
- Bitwise: `And`, `Or`, `Xor`, `Not`.
|
||||
- Shifts: `LeftShift(n)`, `RightShift(n)`.
|
||||
|
||||
- **Arithmetic**:
|
||||
- Addition (`Add`), subtraction (`Sub`), multiplication (`Mul`). Division is commented out—likely reserved for future implementation.
|
||||
|
||||
- **Comparison**:
|
||||
- Full ordering: `<`, `<=`, `>`, `>=`.
|
||||
|
||||
- **Utility Predicates**:
|
||||
- `IsZero()` for zero-checking.
|
||||
|
||||
## Helper Functions
|
||||
|
||||
- `ZeroUint[T]`: Returns the neutral element (zero) for type `T`.
|
||||
- `OneUint[T]`: Constructs value 1 via `Set64(1)`.
|
||||
- `From64[T]`: Converts a standard Go `uint64` into the generic type.
|
||||
|
||||
All operations are **method-chaining friendly** (return `T`, not pointers), enabling fluent syntax. The design promotes correctness and performance in cryptographic or financial contexts where large, fixed-size integers are required.
|
||||
@@ -0,0 +1,30 @@
|
||||
# `obigraph` Package: Semantic Overview
|
||||
|
||||
The `obigraph` package provides a generic, type-safe undirected/directed graph implementation in Go. Its core features include:
|
||||
|
||||
- **Generic Graph Structure**: Parametrized over vertex type `V` and edge data type `T`, enabling flexible use with arbitrary user-defined types.
|
||||
- **Bidirectional Edge Tracking**: Maintains both forward (`Edges`) and reverse (`ReverseEdges`) adjacency maps for efficient neighbor/parent queries.
|
||||
- **Edge Management**:
|
||||
- `AddEdge`: Adds an *undirected* edge (inserted in both directions).
|
||||
- `AddDirectedEdge`: Adds a *directed* edge (only one direction).
|
||||
- `SetAsDirectedEdge`: Converts an existing undirected edge into a directed one by removing the reverse link.
|
||||
- **Graph Queries**:
|
||||
- `Neighbors(v)`: Returns all adjacent vertices (outgoing in directed case).
|
||||
- `Parents(v)`: Returns incoming neighbors via reverse adjacency.
|
||||
- `Degree(v)` / `ParentDegree(v)`: Compute vertex degrees (total or incoming).
|
||||
- **Customizable Vertex/Edge Properties**:
|
||||
- `VertexWeight`, `EdgeWeight`: Funcs to assign weights (default: constant weight = 1.0).
|
||||
- `VertexId`: Custom vertex label generator (default: `"V%d"`).
|
||||
|
||||
- **GML Export**:
|
||||
- `Gml(...)` / `WriteGml(...)`: Generates or writes a Graph Modelling Language (GML) representation.
|
||||
- Supports directed/undirected modes, degree-based filtering (`min_degree`), and visual styling:
|
||||
- Vertex shape: `circle` if weight ≥ threshold, else `rectangle`.
|
||||
- Size scaled by square root of vertex weight.
|
||||
- Uses Go’s `text/template` for rendering.
|
||||
|
||||
- **File I/O**: Directly writes GML to file via `WriteGmlFile(...)`.
|
||||
|
||||
- **Logging & Safety**: Uses Logrus for bounds-checking errors; panics on template parsing/writing failures.
|
||||
|
||||
The package is designed for lightweight, high-performance graph modeling and visualization-ready export.
|
||||
@@ -0,0 +1,14 @@
|
||||
# `obigraph.GraphBuffer` Feature Overview
|
||||
|
||||
The `GraphBuffer[V, T]` type provides a **thread-safe graph construction interface** using buffered edge insertion via Go channels.
|
||||
|
||||
- **Asynchronous Edge Addition**: Edges are enqueued through a `chan Edge[T]`, processed in the background by a goroutine that updates an underlying static graph (`Graph[V, T]`).
|
||||
- **Non-blocking API**: `AddEdge` and `AddDirectedEdge` are non-synchronous — they send to the channel without waiting for graph mutation, enabling high-throughput edge ingestion.
|
||||
- **Graph Initialization**: `NewGraphBuffer` initializes both the graph and a dedicated worker goroutine to consume edges.
|
||||
- **GML Export Support**: Full support for exporting the final graph in [Graph Modelling Language (GML)](https://en.wikipedia.org/wiki/Graph_Modelling_Language), with optional filtering (`min_degree`) and layout parameters (`threshold`, `scale`).
|
||||
- **File & Stream Output**: Methods `WriteGml` and `WriteGmlFile` allow writing GML to any `io.Writer`, including files.
|
||||
- **Resource Cleanup**: The explicit `Close()` method terminates the worker goroutine by closing the channel, ensuring clean shutdown.
|
||||
- **Generic Design**: Fully generic over vertex (`V`) and edge data types (`T`), supporting arbitrary value semantics.
|
||||
|
||||
> ⚠️ **Note**: The buffer is *not* safe for concurrent `AddEdge` calls without external synchronization beyond channel semantics.
|
||||
> ✅ Ideal for producer-consumer patterns where edges are streamed from multiple goroutines into a single graph.
|
||||
@@ -0,0 +1,29 @@
|
||||
# BioSequenceBatch: A Container for Ordered Biological Sequences
|
||||
|
||||
`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
|
||||
|
||||
## Core Properties
|
||||
- **`source`**: String identifying the origin (e.g., file, pipeline stage).
|
||||
- **`order`**: Integer defining processing sequence or priority.
|
||||
- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
|
||||
|
||||
## Key Functionalities
|
||||
- **Construction**:
|
||||
`MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
|
||||
- **Accessors**:
|
||||
`Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
|
||||
- **Mutation (via copy)**:
|
||||
`Reorder(newOrder)` returns a new batch with updated order.
|
||||
- **Size & emptiness**:
|
||||
`Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
|
||||
- **Consumption**:
|
||||
`Pop0()` removes and returns the first sequence (FIFO behavior).
|
||||
- **Safety**:
|
||||
`IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
|
||||
|
||||
## Design Notes
|
||||
- Instances are value types (struct), enabling safe copying.
|
||||
- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
|
||||
- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
|
||||
|
||||
This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
|
||||
@@ -0,0 +1,47 @@
|
||||
# `obiiter`: Stream-Based Biosequence Iterator Library
|
||||
|
||||
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
|
||||
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
|
||||
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
|
||||
|
||||
## Iterator Management
|
||||
|
||||
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
|
||||
- **Lifecycle Control**:
|
||||
- `Add(n)`, `Done()`: Track active workers (like goroutines).
|
||||
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
|
||||
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
|
||||
|
||||
## Batch Transformation & Reorganization
|
||||
|
||||
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
|
||||
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
|
||||
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
|
||||
- **Concatenation & Pooling**:
|
||||
- `Concat(...)`: Sequentially merges multiple iterators.
|
||||
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
|
||||
|
||||
## Filtering & Predicate-Based Processing
|
||||
|
||||
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
|
||||
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
|
||||
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
|
||||
|
||||
## Utility & Analysis
|
||||
|
||||
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
|
||||
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
|
||||
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
|
||||
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
|
||||
|
||||
## Additional Features
|
||||
|
||||
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
|
||||
- Batch ordering preserved for downstream reproducibility.
|
||||
- Integrates with OBITools4’s `obidefault`, `obiutils` for config and resource management.
|
||||
|
||||
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
|
||||
@@ -0,0 +1,32 @@
|
||||
# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
|
||||
|
||||
The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
|
||||
|
||||
- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
|
||||
|
||||
- **Key Fields**:
|
||||
- `outputs`: A map from integer class codes to output streams (`IBioSequence`).
|
||||
- `news`: An unbuffered channel emitting class codes when new output streams are created.
|
||||
- `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
|
||||
|
||||
- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
|
||||
|
||||
- **Batching Strategy**:
|
||||
- Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
|
||||
- Batches are flushed automatically and on finalization.
|
||||
|
||||
- **Asynchronous Processing**:
|
||||
- The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
|
||||
- Outputs are closed only after all sequences have been processed.
|
||||
|
||||
- **Notifications**:
|
||||
- The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
|
||||
|
||||
- **Error Handling**:
|
||||
- `Outputs(key)` returns an error if the requested key has no associated output.
|
||||
|
||||
- **Integration**:
|
||||
- Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
|
||||
- Uses `SortBatches()` on the input iterator to ensure ordered processing.
|
||||
|
||||
In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
|
||||
@@ -0,0 +1,24 @@
|
||||
# `ExtractTaxonomy` Function — Semantic Description
|
||||
|
||||
The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
|
||||
|
||||
- **Input**:
|
||||
- A pointer to `IBioSequence`, representing a sequence iterator over biological data.
|
||||
- A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
|
||||
|
||||
- **Process**:
|
||||
- Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.
|
||||
- For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.
|
||||
- Stops and returns immediately upon encountering the first error during taxonomy extraction.
|
||||
|
||||
- **Output**:
|
||||
- Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).
|
||||
- Returns `nil` error on success; otherwise, returns the first encountered error.
|
||||
|
||||
- **Semantic Role**:
|
||||
Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
|
||||
|
||||
- **Dependencies**:
|
||||
Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
|
||||
|
||||
This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `IFragments` Functionality Overview
|
||||
|
||||
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
|
||||
|
||||
## Core Parameters
|
||||
- `minsize`: Minimum sequence length to skip fragmentation.
|
||||
- `length`: Desired fragment size (in bases/amino acids).
|
||||
- `overlap`: Number of overlapping residues between consecutive fragments.
|
||||
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
|
||||
|
||||
## Workflow
|
||||
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
|
||||
2. **Parallel Fragmentation**:
|
||||
- Each worker processes a subset of batches independently using goroutines.
|
||||
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
|
||||
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
|
||||
3. **Resource Management**:
|
||||
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
|
||||
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
|
||||
|
||||
## Key Features
|
||||
- **Overlap handling**: Ensures contiguous coverage without gaps.
|
||||
- **Memory efficiency**: Uses recycling and batched output.
|
||||
- **Scalability**: Leverages Go concurrency via `nworkers`.
|
||||
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
|
||||
|
||||
## Use Case
|
||||
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
|
||||
@@ -0,0 +1,29 @@
|
||||
# Memory-Limited Biosequence Iterator
|
||||
|
||||
This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`LimitMemory(fraction float64)`**
|
||||
Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
|
||||
|
||||
- **Memory Monitoring**
|
||||
Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
|
||||
|
||||
- **Backpressure Mechanism**
|
||||
While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
|
||||
|
||||
- **Logging**
|
||||
Warns via `obilog.Warnf` when:
|
||||
- Memory pressure persists (every ~1000 yields),
|
||||
- Or wait duration becomes unusually long (>10,000 yielding cycles).
|
||||
|
||||
- **Concurrency Model**
|
||||
- A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
|
||||
- A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
|
||||
|
||||
## Semantic Behavior
|
||||
|
||||
- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
|
||||
- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
|
||||
- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
|
||||
@@ -0,0 +1,19 @@
|
||||
# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
|
||||
|
||||
This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
|
||||
|
||||
- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**
|
||||
- Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
|
||||
- Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
|
||||
- For each batch:
|
||||
- Collects up to `batchsize` sequences via the input iterator.
|
||||
- Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
|
||||
- Wraps merged results into a `BioSequenceBatch` with ordering metadata.
|
||||
- Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
|
||||
|
||||
- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**
|
||||
- A *pipeline combinator* (higher-order function), enabling functional composition.
|
||||
- Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
|
||||
|
||||
**Semantic Purpose**:
|
||||
Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
|
||||
@@ -0,0 +1,35 @@
|
||||
# `NumberSequences` Function — Semantic Description
|
||||
|
||||
The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
|
||||
- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
|
||||
- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
|
||||
|
||||
## Parallelization Strategy
|
||||
|
||||
- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
|
||||
- **Reordering mode**: If `forceReordering` is true:
|
||||
- Input iterator is batch-sorted (`SortBatches()`).
|
||||
- Parallelism disabled (1 worker) to ensure deterministic numbering order.
|
||||
|
||||
## Implementation Details
|
||||
|
||||
- Each goroutine processes its own split of the input iterator.
|
||||
- A shared `next_first` counter tracks the next available sequence number globally.
|
||||
- Locking ensures atomic increment and assignment, preventing race conditions.
|
||||
|
||||
## Output
|
||||
|
||||
Returns a new `IBioSequence` iterator:
|
||||
- Contains the same sequence batches (possibly reordered if sorted).
|
||||
- Each `BioSequence` object now carries a `"seq_number"` attribute.
|
||||
- Paired sequences are co-numbered and marked accordingly.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Preparing data for downstream tools requiring unique sequence IDs.
|
||||
- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
|
||||
- Reproducible numbering across pipeline stages or restarts.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Paired-End Sequence Handling in `obiiter`
|
||||
|
||||
This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
|
||||
|
||||
- `BioSequenceBatch` methods:
|
||||
- **`IsPaired()`**: Checks whether the batch contains paired reads.
|
||||
- **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
|
||||
- **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
|
||||
- **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
|
||||
|
||||
- `IBioSequence` (iterator) methods:
|
||||
- **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
|
||||
- **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
|
||||
- **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
|
||||
- **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
|
||||
|
||||
All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Semantic Description of `obiiter` Package Features
|
||||
|
||||
This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
|
||||
|
||||
- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.
|
||||
- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
|
||||
|
||||
- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
|
||||
|
||||
- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
|
||||
|
||||
- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.
|
||||
- Uses goroutines to ensure non-blocking parallel consumption.
|
||||
- Ensures proper lifecycle management: closing the second stream when the first is closed.
|
||||
- Preserves paired-end status (`MarkAsPaired`) if applicable.
|
||||
|
||||
Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `MakeSetAttributeWorker` Functionality Overview
|
||||
|
||||
The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
|
||||
|
||||
- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
|
||||
|
||||
- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
|
||||
|
||||
- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequence’s metadata.
|
||||
|
||||
- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
|
||||
|
||||
- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
|
||||
|
||||
- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
|
||||
|
||||
- **Use case example**:
|
||||
```go
|
||||
worker := MakeSetAttributeWorker("species")
|
||||
seq = worker(seq) // annotates `seq` with species-level taxon
|
||||
```
|
||||
|
||||
- **Assumptions**:
|
||||
- `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.
|
||||
- Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).
|
||||
- Sequences carry mutable taxonomic metadata.
|
||||
|
||||
- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
|
||||
@@ -0,0 +1,31 @@
|
||||
# `Speed` Functionality Description
|
||||
|
||||
The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Non-intrusive progress bar**:
|
||||
The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
|
||||
|
||||
- **Conditional rendering**:
|
||||
The progress bar is only shown when:
|
||||
- `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
|
||||
- stderr is connected to a terminal (`os.ModeCharDevice`),
|
||||
- stdout is *not* piped (to avoid interfering with file output).
|
||||
|
||||
- **Batch-aware counting**:
|
||||
Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
|
||||
|
||||
- **Paired-end support**:
|
||||
If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
|
||||
|
||||
- **Pipeable wrapper**:
|
||||
`SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- Uses goroutines to decouple iteration and progress updates.
|
||||
- Automatically closes the output iterator when input ends (`WaitAndClose()`).
|
||||
- Prints a final newline to stderr upon completion.
|
||||
|
||||
This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
|
||||
@@ -0,0 +1,20 @@
|
||||
# Semantic Description of `obiiter` Package Functionalities
|
||||
|
||||
This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
|
||||
|
||||
- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:
|
||||
Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
|
||||
|
||||
- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:
|
||||
Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
|
||||
|
||||
- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:
|
||||
Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
|
||||
|
||||
- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:
|
||||
Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
|
||||
|
||||
- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:
|
||||
Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
|
||||
|
||||
All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.
|
||||
@@ -0,0 +1,33 @@
|
||||
# `obiitercsv`: CSV Record Iterator for Streaming and Batch Processing
|
||||
|
||||
This Go package provides a thread-safe, channel-based iterator (`ICSVRecord`) for streaming and processing CSV records in batches. It supports ordered batch handling, concurrent access via mutexes, and dynamic header management.
|
||||
|
||||
## Core Types
|
||||
|
||||
- **`CSVHeader`**: A slice of strings representing column names.
|
||||
- **`CSVRecord`**: A map from field name to value (`map[string]interface{}`).
|
||||
- **`CSVRecordBatch`**: A batch of records with metadata: `source`, `order`, and the actual data slice.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Streaming via Channels**: Records are consumed as `CSVRecordBatch` items through a channel, enabling asynchronous producers/consumers.
|
||||
- **Ordered Processing**: Batches include an `order` field, used by `SortBatches()` to reconstruct sequential order even when received out-of-order.
|
||||
- **Thread Safety**: Uses `sync.RWMutex`, atomic operations (`batch_size`), and `abool.AtomicBool` for flags like `finished`.
|
||||
- **Iterator Protocol**: Implements standard methods:
|
||||
- `Next()` to advance,
|
||||
- `Get()` to retrieve current batch,
|
||||
- `PushBack()` for re-queuing the last record.
|
||||
- **Batch Management**:
|
||||
- `SetHeader()` / `AppendField()`: dynamic header updates.
|
||||
- `Split()`: creates a new iterator sharing the same channel but with independent locking.
|
||||
- **Lifecycle Control**:
|
||||
- `Add()` / `Done()`: track active goroutines (via `sync.WaitGroup`).
|
||||
- `WaitAndClose()` ensures all data is flushed before closing the channel.
|
||||
|
||||
## Utility Methods
|
||||
|
||||
- **`NotEmpty()`, `IsNil()`**: Check batch validity.
|
||||
- **`Consume()`**: Drains the iterator (e.g., for side-effect processing).
|
||||
- **`SortBatches()`**: Reorders batches by `order`, buffering out-of-sequence ones.
|
||||
|
||||
Designed for bioinformatics pipelines (e.g., OBITools4), it enables scalable, memory-efficient CSV processing with strict ordering guarantees.
|
||||
@@ -0,0 +1,36 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
1. **`Count4Mer(seq, buffer, counts)`**
|
||||
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
|
||||
- Encodes each 4-mer into an integer (0–255) using `Encode4mer`.
|
||||
- Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
|
||||
- Reuses or allocates the `counts` buffer as needed.
|
||||
|
||||
2. **`Common4Mer(count1, count2)`**
|
||||
Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
|
||||
Used to measure shared content between sequences.
|
||||
|
||||
3. **`Sum4Mer(count)`**
|
||||
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
|
||||
|
||||
## Distance & Similarity Bounds
|
||||
|
||||
4. **`LCS4MerBounds(count1, count2)`**
|
||||
Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
|
||||
- **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
|
||||
- **Upper bound**: `min(total1, total2) + 3 − ceil((min_total – common)/4)`
|
||||
Leverages the fact that overlapping k-mers constrain possible alignments.
|
||||
|
||||
5. **`Error4MerBounds(count1, count2)`**
|
||||
Estimates bounds for *alignment errors* (e.g., mismatches + indels):
|
||||
- **Upper bound**: `max_total − common_kmers + 2 * floor((common_kmers + 5)/8)`
|
||||
- **Lower bound**: `ceil(upper_bound / 4)`
|
||||
Provides fast, approximate error estimates without full alignment.
|
||||
|
||||
## Use Case
|
||||
|
||||
Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
|
||||
@@ -0,0 +1,44 @@
|
||||
# Semantic Description of the `obikmer` Package
|
||||
|
||||
This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
|
||||
- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
|
||||
- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
|
||||
|
||||
### Graph Operations
|
||||
|
||||
- **Node Queries**:
|
||||
- `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.
|
||||
- `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
|
||||
- **Path Exploration**:
|
||||
- `MaxPath()`: Greedily traces the highest-weight path from a head.
|
||||
- `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
|
||||
- `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
|
||||
|
||||
### Consensus & Filtering
|
||||
|
||||
- **Consensus Generation**:
|
||||
- `BestConsensus()` returns a sequence from the greedy max-weight path.
|
||||
- `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
|
||||
- **Weight Statistics**:
|
||||
- `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.
|
||||
- `FilterMinWeight(min)` removes low-count nodes.
|
||||
- **Decoding**:
|
||||
- `DecodeNode()` converts a k-mer index to its DNA string.
|
||||
- `DecodePath()` reconstructs the full consensus from a path.
|
||||
|
||||
### I/O & Diagnostics
|
||||
|
||||
- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
|
||||
- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
|
||||
- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
|
||||
|
||||
### Dependencies & Design
|
||||
|
||||
- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Go’s stdlib.
|
||||
- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
|
||||
|
||||
Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
|
||||
@@ -0,0 +1,35 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
1. **Nucleotide Encoding**
|
||||
- `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:
|
||||
`0→A`, `1→C`, `2→G`, `3→T/U`.
|
||||
Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).
|
||||
Uses a lookup table for O(1) performance.
|
||||
|
||||
2. **4-mer Encoding**
|
||||
- `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.
|
||||
Each k-mer is encoded as an unsigned byte (0–255), where each nucleotide contributes 2 bits.
|
||||
Supports optional buffer reuse for memory efficiency.
|
||||
|
||||
3. **4-mer Indexing**
|
||||
- `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0–255) to all its occurrence positions in the sequence.
|
||||
Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
|
||||
|
||||
4. **Fast Sequence Comparison**
|
||||
- `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.
|
||||
- Uses precomputed 4-mer index of a reference sequence and encodes the query.
|
||||
- Counts co-occurring 4-mers across all possible shifts (`refpos − queryPos`).
|
||||
- Computes raw and relative scores (normalized by alignment length).
|
||||
- Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-aware**: Supports buffer reuse to minimize allocations.
|
||||
- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).
|
||||
- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.
|
||||
|
||||
Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
|
||||
@@ -0,0 +1,39 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
|
||||
|
||||
## Core Encoding & Decoding
|
||||
|
||||
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
|
||||
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
|
||||
|
||||
## Iterators (Memory-Efficient Streaming)
|
||||
|
||||
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
|
||||
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for **odd k ≤ 31**.
|
||||
|
||||
## Error Handling & Markers
|
||||
|
||||
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
|
||||
|
||||
## Reverse Complement & Circular Normalization
|
||||
|
||||
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
|
||||
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
|
||||
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
|
||||
|
||||
## Counting & Math Utilities
|
||||
|
||||
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreau’s necklace formula**, with Euler's totient function and divisor enumeration.
|
||||
|
||||
## Performance & Safety
|
||||
|
||||
- All functions avoid heap allocations where possible (reusing buffers).
|
||||
- Panics on invalid `k` or length mismatches for correctness.
|
||||
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
|
||||
- Error-aware k-mer filtering in sequencing pipelines
|
||||
- Low-complexity region detection via circular entropy normalization
|
||||
@@ -0,0 +1,36 @@
|
||||
# Obikmer: Efficient K-mer Encoding and Manipulation in Go
|
||||
|
||||
This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
### K-mer Encoding (`EncodeKmers`, `IterKmers`)
|
||||
Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
|
||||
|
||||
### Reverse Complement (`ReverseComplement`)
|
||||
Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
|
||||
|
||||
### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
|
||||
Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
|
||||
|
||||
### Super *k*-mers Extraction (`ExtractSuperKmers`)
|
||||
Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
|
||||
|
||||
### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
|
||||
Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
|
||||
- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
|
||||
- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
|
||||
- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Genome assembly &DBG construction
|
||||
- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)
|
||||
- Error-aware k-mer counting & filtering
|
||||
- Strand-unbiased sequence comparison
|
||||
|
||||
All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
|
||||
@@ -0,0 +1,31 @@
|
||||
# Semantic Description of `obikmer` Entropy Functions
|
||||
|
||||
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`KmerEntropy(kmer, k, levelMax)`**:
|
||||
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
|
||||
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
|
||||
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
|
||||
- Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
|
||||
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
|
||||
|
||||
- **`KmerEntropyFilter`**:
|
||||
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
|
||||
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
|
||||
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
|
||||
- **Not goroutine-safe** — each thread must instantiate its own filter.
|
||||
|
||||
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
|
||||
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
|
||||
|
||||
- **`Accept(kmer)` / `Entropy(kmer)`**:
|
||||
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
|
||||
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).
|
||||
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
|
||||
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
|
||||
@@ -0,0 +1,37 @@
|
||||
# K-Way Merge for Sorted k-mer Streams
|
||||
|
||||
This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
|
||||
|
||||
## Core Components
|
||||
|
||||
- **`mergeItem`**: Stores a value and its source reader index for heap operations.
|
||||
- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
|
||||
- **`KWayMerge`**: Main struct managing the heap and input readers.
|
||||
|
||||
## Key Functionality
|
||||
|
||||
- **Initialization (`NewKWayMerge`)**:
|
||||
- Takes a slice of `*KdiReader`, each expected to yield sorted values.
|
||||
- Preloads the heap with one value from each reader.
|
||||
|
||||
- **Streaming Output (`Next`)**:
|
||||
- Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
|
||||
- Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
|
||||
|
||||
- **Cleanup (`Close`)**:
|
||||
- Closes all underlying `KdiReader`s and returns the first encountered error.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
|
||||
- Efficient deduplication with multiplicity tracking.
|
||||
- Scalable union/intersection operations on large *k*-mer sets.
|
||||
|
||||
## Complexity
|
||||
|
||||
| Operation | Time |
|
||||
|-----------|------------|
|
||||
| `Next()` | *O(log k)* (heap ops per unique value) |
|
||||
| Init | *O(k)* |
|
||||
|
||||
Where `k` = number of input readers.
|
||||
@@ -0,0 +1,27 @@
|
||||
# K-Way Merge Functionality in `obikmer`
|
||||
|
||||
This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
|
||||
- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
|
||||
- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
|
||||
- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
|
||||
- **Robust Test Coverage**: Includes unit tests for:
|
||||
- Basic merging with overlapping and non-overlapping values.
|
||||
- Single-stream input (degenerate case).
|
||||
- Empty streams handling.
|
||||
- All identical k-mers across inputs.
|
||||
|
||||
## API Highlights
|
||||
|
||||
- `NewKdiReader(path)` — opens a `.kdi` file for reading.
|
||||
- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
|
||||
- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
|
||||
- `.Next()` → `(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
|
||||
- `.Close()` — cleans up resources.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each sample’s k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
|
||||
@@ -0,0 +1,27 @@
|
||||
# KDI Reader: Streaming Delta-Varint Decoding for k-mers
|
||||
|
||||
The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
|
||||
- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
|
||||
- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
|
||||
- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
|
||||
- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
|
||||
|
||||
## Key API
|
||||
|
||||
- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
|
||||
- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
|
||||
- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
|
||||
- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
|
||||
- `Count()` / `Remaining()` → total and unread k-mers.
|
||||
- `Close()` → releases file handle.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Uses 64 KB buffer for efficient I/O.
|
||||
- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
|
||||
- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
|
||||
- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
|
||||
@@ -0,0 +1,34 @@
|
||||
# KDI File Format and API
|
||||
|
||||
The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
|
||||
- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
|
||||
- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
|
||||
|
||||
## File Structure
|
||||
|
||||
A `.kdi` file contains:
|
||||
1. **Magic header** (4 bytes): Identifies the format.
|
||||
2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
|
||||
3. **First value** (8 bytes, `uint64`): Initial k-mer.
|
||||
4. **Delta-encoded tail**: `(n−1)` deltas, each encoded as a `varint`.
|
||||
|
||||
## API
|
||||
|
||||
- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
|
||||
- **`Writer.Count()`**: Returns the number of written items before closing.
|
||||
- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
|
||||
- **`Reader.Count()`**: Returns total stored count.
|
||||
|
||||
## Tests Validate
|
||||
|
||||
1. Basic round-trip with diverse values (including large `uint64`s).
|
||||
2. Empty and single-k-mer files.
|
||||
3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
|
||||
4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
|
||||
5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
|
||||
|
||||
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
|
||||
@@ -0,0 +1,38 @@
|
||||
# KDI File Format and Writer
|
||||
|
||||
The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
|
||||
|
||||
## Core Format (`.kdi`)
|
||||
|
||||
- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
|
||||
- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
|
||||
- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
|
||||
- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
|
||||
|
||||
## Writer (`KdiWriter`)
|
||||
|
||||
- **Strict ordering**: K-mers must be written in *strictly increasing order*.
|
||||
- Efficient buffering via `bufio.Writer` (64 KB buffer).
|
||||
- Internally tracks:
|
||||
- Current k-mer count,
|
||||
- Previous value (for delta computation),
|
||||
- Bytes written in data section.
|
||||
- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
|
||||
|
||||
## Companion Index (`.kdx`)
|
||||
|
||||
- Written automatically on `Close()` if indexing entries exist.
|
||||
- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
|
||||
- Enables efficient random access without full file scan.
|
||||
|
||||
## Usage Pattern
|
||||
|
||||
```go
|
||||
w, _ := obikmer.NewKdiWriter("data.kdi")
|
||||
for _, kmer := range sortedKMers {
|
||||
w.Write(kmer)
|
||||
}
|
||||
w.Close() // finalizes header, writes .kdx index
|
||||
```
|
||||
|
||||
The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
|
||||
@@ -0,0 +1,29 @@
|
||||
# KDX Index Format and Functionality
|
||||
|
||||
The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **Magic bytes**: `KDX\x01` validates file integrity.
|
||||
- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
|
||||
- **Entry structure**: Each entry stores:
|
||||
- `kmer`: the k-mer value at that index position.
|
||||
- `offset`: absolute byte offset in the corresponding `.kdi` file.
|
||||
|
||||
## Key Operations
|
||||
|
||||
- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
|
||||
- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
|
||||
- Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
|
||||
- Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
|
||||
- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
|
||||
- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
|
||||
|
||||
## Performance
|
||||
|
||||
- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
|
||||
- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
|
||||
|
||||
- `QueryEntry` represents a single canonical k‑mer with its origin: sequence index and 1-based position.
|
||||
- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
|
||||
- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each k‑mer to a partition via modulo hashing, and sorts buckets by k‑mer value.
|
||||
- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
|
||||
- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
|
||||
- Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
|
||||
- Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
|
||||
- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and k‑mer stream in lockstep.
|
||||
|
||||
The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
|
||||
@@ -0,0 +1,49 @@
|
||||
# `obikmer` K-mer Set Group Builder — Functional Overview
|
||||
|
||||
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
|
||||
|
||||
## Core Features
|
||||
|
||||
- **K-mer & Minimizer Configuration**:
|
||||
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
|
||||
|
||||
- **Functional Options for Filtering**:
|
||||
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
|
||||
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
|
||||
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
|
||||
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
|
||||
|
||||
- **Concurrent & Pipeline-Aware Processing**:
|
||||
Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
|
||||
|
||||
- **Partitioned I/O & Thread Safety**:
|
||||
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Build Phase**:
|
||||
- Input sequences → super-kmers extracted via minimizer-based partitioning.
|
||||
- Super-kmers written to `.build/set_*/part_*.skm`.
|
||||
|
||||
2. **Finalization (`Close()`)**:
|
||||
- `.skm` files loaded → canonical k-mers extracted.
|
||||
- K-mers sorted, counted (frequency spectrum), and filtered per config.
|
||||
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
|
||||
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
|
||||
|
||||
3. **Append Mode**:
|
||||
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
|
||||
|
||||
## Output Artifacts
|
||||
|
||||
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
|
||||
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
|
||||
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
|
||||
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
|
||||
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
|
||||
- **Robust error handling**: Early termination on first failure; cleanup of partial state.
|
||||
|
||||
@@ -0,0 +1,44 @@
|
||||
# K-mer Set Group Builder — Semantic Description
|
||||
|
||||
This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
- **K-mer Set Group Construction**:
|
||||
`NewKmerSetGroupBuilder` creates a builder configured with:
|
||||
- *k* (k-mer length),
|
||||
- *m* (minimal unique substring for partitioning),
|
||||
- number of sets (`nSets`),
|
||||
- and optional parameters like `WithMinFrequency`.
|
||||
|
||||
- **Sequence Ingestion**:
|
||||
Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
|
||||
- Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
|
||||
- Deduplicated and optionally filtered by occurrence frequency.
|
||||
|
||||
- **Persistence & Round-Trip**:
|
||||
`builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).
|
||||
`OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
|
||||
|
||||
- **Metadata & Attributes**:
|
||||
Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
|
||||
|
||||
- **Efficient Iteration**:
|
||||
The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
|
||||
|
||||
- **Frequency Filtering**:
|
||||
`WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
|
||||
|
||||
- **Multi-set Support**:
|
||||
Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
|
||||
|
||||
### Testing Coverage
|
||||
|
||||
Comprehensive unit tests validate:
|
||||
- Basic construction & correctness,
|
||||
- Multi-sequence ingestion and deduplication,
|
||||
- Frequency-based inclusion/exclusion logic,
|
||||
- Cross-set isolation (`nSets > 1`),
|
||||
- Metadata round-trip integrity.
|
||||
|
||||
This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user