⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,22 @@
+# `obialign` Package: Sequence Alignment Utilities
+
+The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
+
+- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
+
+- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
+  - At mismatches, it retains the higher-quality base.
+  - When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
+  - Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
+  - Optionally records mismatch statistics in sequence attributes.
+
+- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
+
+- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
+
+- **Alignment Path Format**: The path is a sequence of signed integers encoding:
+  - Negative steps → deletions in seqB (insertion in A),
+  - Positive steps → insertions in B,
+  - Consecutive pairs encode match/mismatch runs.
+
+This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
@@ -0,0 +1,30 @@
+# Semantic Description of `obialign` Backtracking Module
+
+The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
+
+## Core Functionality
+
+- **Input**:  
+  - `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.  
+  - `lseqA`, `lseqB`: Lengths of sequences A and B.  
+  - `path`: Pre-allocated slice to store the traceback path.
+
+- **Output**: A compact representation of alignment steps, alternating between:
+  - Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
+  - Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
+
+## Algorithm Highlights
+
+- **Reverse traversal** from `(lseqA−1, lseqB−1)` to origin.
+- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
+- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
+- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
+
+## Encoded Path Semantics
+
+Each pair in the returned slice encodes:
+- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
+
+## Use Case
+
+Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
@@ -0,0 +1,26 @@
+# Semantic Description of `obialign` Package
+
+This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
+
+## Key Functionalities
+
+- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
+
+- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
+
+- **Log-space arithmetic helpers**:
+  - `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
+  - `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 − exp(a))` and `log(exp(a) − exp(b))`, critical for probability transformations.
+
+- **Match/mismatch scoring (`_MatchScoreRatio`)**:
+  - Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
+  - Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
+
+- **Precomputed scoring matrices**:
+  - `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
+  - `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
+
+- **Thread-safe initialization**:
+  - `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
+
+All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
@@ -0,0 +1,23 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
+
+- **Core Encoding Strategy**:  
+  A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
+
+- **`encodeValues(score, length int, out bool)`**:  
+  Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
+
+- **`decodeValues(value uint64)`**:  
+  Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
+
+- **Utility Bitwise Helpers**:
+  - `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
+  - `_incscore(value)`: increments score by `1 << wsize`.
+  - `_setout(value)`: clears the out-flag, marking value as *not* terminal.
+
+- **Predefined Constants**:
+  - `_empty`: neutral state (score=0, length=0).
+  - `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
+
+This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
@@ -0,0 +1,42 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
+
+## Core Algorithm
+
+- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
+- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
+- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
+
+## Scoring Scheme
+
+- **Match**: +1 point  
+- **Mismatch or gap (indel)**: 0 points  
+
+## Key Functions
+
+1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`  
+   - Computes LCS score and alignment length between raw byte sequences.  
+   - If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).  
+   - Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.  
+   - Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
+
+2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
+   - Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.  
+   - Designed for standard biosequence inputs.
+
+3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
+   - Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
+
+## Features
+
+- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
+- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
+- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
+- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
+
+## Use Cases
+
+- Molecular barcode/UMI clustering  
+- Read-to-reference alignment in amplicon sequencing  
+- Similarity filtering of biological sequences
@@ -0,0 +1,15 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
+
+- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
+- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).  
+- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
+- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
+- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
+- **Lookup tables**:
+  - `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
+  - `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
+- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
+
+The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
@@ -0,0 +1,19 @@
+## `obialign` Package: Semantic Overview (≤50 lines)
+
+The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
+
+- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
+- `D1Or0(...)`: Returns a 4-tuple:
+  - **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
+  - **`int` (second)**: Position of the differing site (`-1` if identical).
+  - **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
+
+**Algorithmic strategy:**
+1. Early rejection if length difference exceeds 1.
+2. Forward scan until first mismatch → identifies left boundary of divergence.
+3. Backward scan from ends to find rightmost match boundary.
+4. Validates whether the mismatch region allows exactly one edit:
+   - Single substitution: equal lengths, single divergent position.
+   - Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
+
+Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
@@ -0,0 +1,29 @@
+# `LocatePattern` Functionality Overview
+
+The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
+
+- **Input**:  
+  - `id`: identifier for logging/error reporting.  
+  - `pattern []byte`: the query sequence (e.g., primer).  
+  - `sequence []byte`: the target read/contig.  
+
+- **Constraints**:  
+  - Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).  
+
+- **Scoring Scheme**:  
+  - Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).  
+  - Mismatch/Gap: `-1`.  
+
+- **Algorithm Features**:  
+  - End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.  
+  - Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).  
+  - Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).  
+  - Backtracks from the bottom-right to find optimal local alignment start/end coordinates.  
+
+- **Output**:  
+  - `start`: starting index in `sequence`.  
+  - `end+1`: ending index (exclusive) of best match.  
+  - Error count: `-score`, i.e., number of mismatches/gaps in alignment.  
+
+- **Use Case**:  
+  Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
@@ -0,0 +1,37 @@
+# Semantic Description of `obialign` Package
+
+The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
+
+## Core Functionalities
+
+### 1. **Memory Arena Management**
+- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
+- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
+
+### 2. **Dynamic Programming Alignment Functions**
+Implements three specialized global alignment variants using Needleman–Wunsch with affine gap penalties (scaled per mismatch):
+
+- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
+- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
+- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
+
+All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
+
+### 3. **Scoring & Quality Integration**
+- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
+  - Nucleotide compatibility (via precomputed `_NucPartMatch`)
+  - Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
+  - A user-defined `scale` factor to modulate mismatch penalties.
+
+### 4. **Fast Heuristic Pre-Alignment**
+The main `PEAlign` function integrates a kmer-based fast pre-screening:
+- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
+- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
+- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
+
+### 5. **Backtracking & Path Output**
+- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
+- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
+
+### Use Case
+Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
@@ -0,0 +1,58 @@
+# Semantic Description of `obialign.ReadAlign`
+
+The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
+
+## Core Functionality
+
+- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:  
+  - `gap`: gap penalty (linear)  
+  - `scale`: scaling factor for quality scores  
+  - `delta`: extension buffer around initial overlap estimate  
+  - `fastScoreRel`: use relative vs absolute k-mer matching score  
+
+## Algorithm Overview
+
+1. **Preprocessing & Initialization**  
+   - Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).  
+
+2. **Fast Overlap Estimation via 4-mer Indexing**  
+   - Builds a k-mer index of `seqA` using `obikmer.Index4mer`.  
+   - Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.  
+   - Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).  
+
+3. **Overlap Computation**  
+   - Determines overlap length `over` based on shift:  
+     ```text
+       over = |seqA| - shift    if shift > 0  
+              |seqB| + shift    if shift < 0  
+              min(|seqA|,|seqB)| otherwise
+     ```
+
+4. **Dynamic Programming Alignment**  
+   - If overlap is *not* identical (`fastCount + 3 < over`):  
+     - Extracts subregions with `delta`-buffered boundaries.  
+     - Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.  
+     - Backtracks via `_Backtracking` to produce alignment path.  
+   - Else (near-perfect overlap):  
+     - Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.  
+     - Returns trivial path `[extra5, partLen]`.
+
+## Output
+
+Returns:  
+
+| Index | Type     | Meaning |
+|-------|----------|---------|
+| 0️⃣    | `int`     | Final alignment score (weighted by quality) |
+| 1️⃣    | `[]int`   | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
+| 2️⃣    | `int`     | K-mer match count (`fastCount`) |
+| 3️⃣    | `int`     | Overlap length (`over`) |
+| 4️⃣    | `float64` | K-mer-based score (`fastScore`) |
+| 5️⃣    | `bool`    | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
+
+## Key Design Highlights
+
+- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.  
+- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.  
+- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).  
+- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.