mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,22 @@
|
||||
# `obialign` Package: Sequence Alignment Utilities
|
||||
|
||||
The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
|
||||
|
||||
- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
|
||||
|
||||
- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
|
||||
- At mismatches, it retains the higher-quality base.
|
||||
- When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
|
||||
- Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
|
||||
- Optionally records mismatch statistics in sequence attributes.
|
||||
|
||||
- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
|
||||
|
||||
- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
|
||||
|
||||
- **Alignment Path Format**: The path is a sequence of signed integers encoding:
|
||||
- Negative steps → deletions in seqB (insertion in A),
|
||||
- Positive steps → insertions in B,
|
||||
- Consecutive pairs encode match/mismatch runs.
|
||||
|
||||
This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
|
||||
@@ -0,0 +1,30 @@
|
||||
# Semantic Description of `obialign` Backtracking Module
|
||||
|
||||
The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**:
|
||||
- `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.
|
||||
- `lseqA`, `lseqB`: Lengths of sequences A and B.
|
||||
- `path`: Pre-allocated slice to store the traceback path.
|
||||
|
||||
- **Output**: A compact representation of alignment steps, alternating between:
|
||||
- Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
|
||||
- Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
|
||||
|
||||
## Algorithm Highlights
|
||||
|
||||
- **Reverse traversal** from `(lseqA−1, lseqB−1)` to origin.
|
||||
- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
|
||||
- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
|
||||
- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
|
||||
|
||||
## Encoded Path Semantics
|
||||
|
||||
Each pair in the returned slice encodes:
|
||||
- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
|
||||
|
||||
## Use Case
|
||||
|
||||
Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
|
||||
@@ -0,0 +1,26 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
|
||||
|
||||
## Key Functionalities
|
||||
|
||||
- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
|
||||
|
||||
- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
|
||||
|
||||
- **Log-space arithmetic helpers**:
|
||||
- `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
|
||||
- `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 − exp(a))` and `log(exp(a) − exp(b))`, critical for probability transformations.
|
||||
|
||||
- **Match/mismatch scoring (`_MatchScoreRatio`)**:
|
||||
- Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
|
||||
- Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
|
||||
|
||||
- **Precomputed scoring matrices**:
|
||||
- `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
|
||||
- `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
|
||||
|
||||
- **Thread-safe initialization**:
|
||||
- `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
|
||||
|
||||
All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
|
||||
@@ -0,0 +1,23 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
|
||||
|
||||
- **Core Encoding Strategy**:
|
||||
A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
|
||||
|
||||
- **`encodeValues(score, length int, out bool)`**:
|
||||
Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
|
||||
|
||||
- **`decodeValues(value uint64)`**:
|
||||
Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
|
||||
|
||||
- **Utility Bitwise Helpers**:
|
||||
- `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
|
||||
- `_incscore(value)`: increments score by `1 << wsize`.
|
||||
- `_setout(value)`: clears the out-flag, marking value as *not* terminal.
|
||||
|
||||
- **Predefined Constants**:
|
||||
- `_empty`: neutral state (score=0, length=0).
|
||||
- `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
|
||||
|
||||
This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
|
||||
@@ -0,0 +1,42 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
|
||||
|
||||
## Core Algorithm
|
||||
|
||||
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
|
||||
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
|
||||
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
|
||||
|
||||
## Scoring Scheme
|
||||
|
||||
- **Match**: +1 point
|
||||
- **Mismatch or gap (indel)**: 0 points
|
||||
|
||||
## Key Functions
|
||||
|
||||
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
|
||||
- Computes LCS score and alignment length between raw byte sequences.
|
||||
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
|
||||
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
|
||||
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
|
||||
|
||||
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
|
||||
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
|
||||
- Designed for standard biosequence inputs.
|
||||
|
||||
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
|
||||
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
|
||||
|
||||
## Features
|
||||
|
||||
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
|
||||
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
|
||||
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
|
||||
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Molecular barcode/UMI clustering
|
||||
- Read-to-reference alignment in amplicon sequencing
|
||||
- Similarity filtering of biological sequences
|
||||
@@ -0,0 +1,15 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
|
||||
|
||||
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
|
||||
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
|
||||
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
|
||||
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
|
||||
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
|
||||
- **Lookup tables**:
|
||||
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
|
||||
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
|
||||
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
|
||||
|
||||
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
|
||||
@@ -0,0 +1,19 @@
|
||||
## `obialign` Package: Semantic Overview (≤50 lines)
|
||||
|
||||
The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
|
||||
|
||||
- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
|
||||
- `D1Or0(...)`: Returns a 4-tuple:
|
||||
- **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
|
||||
- **`int` (second)**: Position of the differing site (`-1` if identical).
|
||||
- **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
|
||||
|
||||
**Algorithmic strategy:**
|
||||
1. Early rejection if length difference exceeds 1.
|
||||
2. Forward scan until first mismatch → identifies left boundary of divergence.
|
||||
3. Backward scan from ends to find rightmost match boundary.
|
||||
4. Validates whether the mismatch region allows exactly one edit:
|
||||
- Single substitution: equal lengths, single divergent position.
|
||||
- Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
|
||||
|
||||
Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
|
||||
@@ -0,0 +1,29 @@
|
||||
# `LocatePattern` Functionality Overview
|
||||
|
||||
The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
|
||||
|
||||
- **Input**:
|
||||
- `id`: identifier for logging/error reporting.
|
||||
- `pattern []byte`: the query sequence (e.g., primer).
|
||||
- `sequence []byte`: the target read/contig.
|
||||
|
||||
- **Constraints**:
|
||||
- Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).
|
||||
|
||||
- **Scoring Scheme**:
|
||||
- Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).
|
||||
- Mismatch/Gap: `-1`.
|
||||
|
||||
- **Algorithm Features**:
|
||||
- End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.
|
||||
- Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).
|
||||
- Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).
|
||||
- Backtracks from the bottom-right to find optimal local alignment start/end coordinates.
|
||||
|
||||
- **Output**:
|
||||
- `start`: starting index in `sequence`.
|
||||
- `end+1`: ending index (exclusive) of best match.
|
||||
- Error count: `-score`, i.e., number of mismatches/gaps in alignment.
|
||||
|
||||
- **Use Case**:
|
||||
Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
|
||||
@@ -0,0 +1,37 @@
|
||||
# Semantic Description of `obialign` Package
|
||||
|
||||
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
### 1. **Memory Arena Management**
|
||||
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
|
||||
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
|
||||
|
||||
### 2. **Dynamic Programming Alignment Functions**
|
||||
Implements three specialized global alignment variants using Needleman–Wunsch with affine gap penalties (scaled per mismatch):
|
||||
|
||||
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
|
||||
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
|
||||
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
|
||||
|
||||
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
|
||||
|
||||
### 3. **Scoring & Quality Integration**
|
||||
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
|
||||
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
|
||||
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
|
||||
- A user-defined `scale` factor to modulate mismatch penalties.
|
||||
|
||||
### 4. **Fast Heuristic Pre-Alignment**
|
||||
The main `PEAlign` function integrates a kmer-based fast pre-screening:
|
||||
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
|
||||
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
|
||||
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
|
||||
|
||||
### 5. **Backtracking & Path Output**
|
||||
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
|
||||
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
|
||||
|
||||
### Use Case
|
||||
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
|
||||
@@ -0,0 +1,58 @@
|
||||
# Semantic Description of `obialign.ReadAlign`
|
||||
|
||||
The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:
|
||||
- `gap`: gap penalty (linear)
|
||||
- `scale`: scaling factor for quality scores
|
||||
- `delta`: extension buffer around initial overlap estimate
|
||||
- `fastScoreRel`: use relative vs absolute k-mer matching score
|
||||
|
||||
## Algorithm Overview
|
||||
|
||||
1. **Preprocessing & Initialization**
|
||||
- Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).
|
||||
|
||||
2. **Fast Overlap Estimation via 4-mer Indexing**
|
||||
- Builds a k-mer index of `seqA` using `obikmer.Index4mer`.
|
||||
- Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.
|
||||
- Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).
|
||||
|
||||
3. **Overlap Computation**
|
||||
- Determines overlap length `over` based on shift:
|
||||
```text
|
||||
over = |seqA| - shift if shift > 0
|
||||
|seqB| + shift if shift < 0
|
||||
min(|seqA|,|seqB)| otherwise
|
||||
```
|
||||
|
||||
4. **Dynamic Programming Alignment**
|
||||
- If overlap is *not* identical (`fastCount + 3 < over`):
|
||||
- Extracts subregions with `delta`-buffered boundaries.
|
||||
- Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.
|
||||
- Backtracks via `_Backtracking` to produce alignment path.
|
||||
- Else (near-perfect overlap):
|
||||
- Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.
|
||||
- Returns trivial path `[extra5, partLen]`.
|
||||
|
||||
## Output
|
||||
|
||||
Returns:
|
||||
|
||||
| Index | Type | Meaning |
|
||||
|-------|----------|---------|
|
||||
| 0️⃣ | `int` | Final alignment score (weighted by quality) |
|
||||
| 1️⃣ | `[]int` | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
|
||||
| 2️⃣ | `int` | K-mer match count (`fastCount`) |
|
||||
| 3️⃣ | `int` | Overlap length (`over`) |
|
||||
| 4️⃣ | `float64` | K-mer-based score (`fastScore`) |
|
||||
| 5️⃣ | `bool` | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
|
||||
|
||||
## Key Design Highlights
|
||||
|
||||
- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.
|
||||
- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.
|
||||
- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).
|
||||
- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
|
||||
Reference in New Issue
Block a user