mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
38 lines
2.5 KiB
Markdown
38 lines
2.5 KiB
Markdown
|
|
# Semantic Description of `obialign` Package
|
|||
|
|
|
|||
|
|
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
|
|||
|
|
|
|||
|
|
## Core Functionalities
|
|||
|
|
|
|||
|
|
### 1. **Memory Arena Management**
|
|||
|
|
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
|
|||
|
|
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
|
|||
|
|
|
|||
|
|
### 2. **Dynamic Programming Alignment Functions**
|
|||
|
|
Implements three specialized global alignment variants using Needleman–Wunsch with affine gap penalties (scaled per mismatch):
|
|||
|
|
|
|||
|
|
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
|
|||
|
|
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
|
|||
|
|
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
|
|||
|
|
|
|||
|
|
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
|
|||
|
|
|
|||
|
|
### 3. **Scoring & Quality Integration**
|
|||
|
|
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
|
|||
|
|
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
|
|||
|
|
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
|
|||
|
|
- A user-defined `scale` factor to modulate mismatch penalties.
|
|||
|
|
|
|||
|
|
### 4. **Fast Heuristic Pre-Alignment**
|
|||
|
|
The main `PEAlign` function integrates a kmer-based fast pre-screening:
|
|||
|
|
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
|
|||
|
|
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
|
|||
|
|
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
|
|||
|
|
|
|||
|
|
### 5. **Backtracking & Path Output**
|
|||
|
|
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
|
|||
|
|
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
|
|||
|
|
|
|||
|
|
### Use Case
|
|||
|
|
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
|