# Semantic Description of `obialign` Package The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment. ## Core Algorithm - Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency. - Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form. - Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity. ## Scoring Scheme - **Match**: +1 point - **Mismatch or gap (indel)**: 0 points ## Key Functions 1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)` - Computes LCS score and alignment length between raw byte sequences. - If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment). - Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A. - Returns `-1, -1, -1` if the actual error count exceeds `maxError`. 2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)` - Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default. - Designed for standard biosequence inputs. 3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)` - Computes standard LCS (including end gaps). Returns `(score, alignment_length)`. ## Features - **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps. - **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls. - **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`). - **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4). ## Use Cases - Molecular barcode/UMI clustering - Read-to-reference alignment in amplicon sequencing - Similarity filtering of biological sequences