Files
obitools4/autodoc/docmd/pkg/obialign/fastlcsegf.md
T

43 lines
2.1 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
## Core Algorithm
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
## Scoring Scheme
- **Match**: +1 point
- **Mismatch or gap (indel)**: 0 points
## Key Functions
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
- Computes LCS score and alignment length between raw byte sequences.
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
- Designed for standard biosequence inputs.
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
## Features
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
## Use Cases
- Molecular barcode/UMI clustering
- Read-to-reference alignment in amplicon sequencing
- Similarity filtering of biological sequences