autodoc/docmd/pkg/obialign/fastlcsegf.md

# Semantic Description of `obialign` Package

The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.

## Core Algorithm

- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.

## Scoring Scheme

- **Match**: +1 point  
- **Mismatch or gap (indel)**: 0 points  

## Key Functions

1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`  
   - Computes LCS score and alignment length between raw byte sequences.  
   - If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).  
   - Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.  
   - Returns `-1, -1, -1` if the actual error count exceeds `maxError`.

2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
   - Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.  
   - Designed for standard biosequence inputs.

3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`  
   - Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.

## Features

- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).

## Use Cases

- Molecular barcode/UMI clustering  
- Read-to-reference alignment in amplicon sequencing  
- Similarity filtering of biological sequences
⬆️ version bump to v4.5 2026-04-07 08:36:50 +02:00			# Semantic Description of `obialign` Package

			The `obialign` package provides high-performance functions for computing the Longest Common Subsequence (LCS) between two biological sequences, with support for error tolerance and end-gap-free alignment.

			`## Core Algorithm`

			`- Implements a Needleman-Wunsch dynamic programming algorithm optimized for speed and memory efficiency.`
			- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
			`- Leverages diagonal banding to restrict computation only within the allowed error margin, reducing time and space complexity.`

			`## Scoring Scheme`

			`- Match: +1 point`
			`- Mismatch or gap (indel): 0 points`

			`## Key Functions`

			1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
			`- Computes LCS score and alignment length between raw byte sequences.`
			- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
			- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
			- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.

			2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
			- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
			`- Designed for standard biosequence inputs.`

			3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
			- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.

			`## Features`

			- Error-bounded: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
			`- Memory-efficient: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.`
			- IUPAC-aware: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
			`- Optimized for short reads: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).`

			`## Use Cases`

			`- Molecular barcode/UMI clustering`
			`- Read-to-reference alignment in amplicon sequencing`
			`- Similarity filtering of biological sequences`