Files
obitools4/autodoc/docmd/pkg_obisuffix.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

58 lines
3.3 KiB
Markdown

# obisuffix: Suffix Array Package for Biological Sequence Analysis
The `obisuffix` package implements a suffix array tailored to biological sequences, enabling efficient lexicographic ordering and prefix analysis across multiple input sequences. It supports DNA, RNA, and protein data via integration with `obiseq.BioSequenceSlice`, making it suitable for repeat detection, k-mer mining, and alignment-free comparison workflows.
## Core Data Structures
### `Suffix`
Represents a single suffix by storing:
- `Idx int`: Index of the source sequence in the input slice.
- `Pos int`: Starting position (0-based) within that sequence.
### `SuffixArray`
Encapsulates:
- `Data []Suffix`: Sorted list of all suffixes.
- `Sequences obiseq.BioSequenceSlice`: Original input sequences (immutable reference).
- `Common []int`: Cached longest common prefix lengths between adjacent suffixes (`Data[i]` and `Data[i+1]`). Lazily computed.
## Public Functions
### `BuildSuffixArray(data obiseq.BioSequenceSlice) *SuffixArray`
Constructs a suffix array from one or more biological sequences:
- Enumerates **every** suffix of every sequence (i.e., for a sequence `s`, adds all `(Idx, Pos)` where `0 ≤ Pos < len(s)`).
- Sorts suffixes lexicographically using a deterministic comparator (`SuffixLess`):
- Primary: Compare nucleotide/amino-acid content character-by-character.
- Tie-breakers (if prefixes match up to min length):
1. Shorter suffix comes first.
2. Lower `Idx` (sequence index).
3. Earlier `Pos`.
- Precomputes and caches the common-prefix array via internal call to `CommonSuffix()`.
### `(*SuffixArray) CommonSuffix() []int`
Computes the length of the longest common prefix (LCP) between each adjacent pair in `Data`:
- Returns a slice of length `len(Data)-1`, where `Common[i] = LCP(Data[i], Data[i+1])`.
- Uses memoization: If already computed (e.g., after `BuildSuffixArray`), returns the cached result.
- Avoids redundant comparisons by leveraging sorted order and early termination.
### `(*SuffixArray) String() string`
Returns a formatted, human-readable table for inspection:
- Columns: `Common`, `Idx`, `Pos`, and the actual suffix string (via `.Substring()`).
- Useful for debugging, educational demos, or visualizing repeat patterns and overlaps.
## Semantic Guarantees & Design Choices
- **Deterministic ordering**: Tie-breaking rules ensure reproducibility across runs and platforms.
- **Memory efficiency**: Stores only indices (not copies of suffixes), critical for large genomic datasets.
- **Biological fidelity**: Respects alphabet semantics (e.g., `A < C < G < T` for DNA) via underlying sequence comparison.
- **Lazy evaluation**: `CommonSuffix()` is invoked only when needed (e.g., on first call to `.String()`, or explicitly), avoiding unnecessary work.
- **Transparency**: All public fields are accessible, enabling downstream analysis without encapsulation barriers.
## Typical Use Cases
- Detecting tandem repeats or low-complexity regions across multi-sequence datasets.
- Building suffix arrays for *de novo* assembly validation or error correction.
- Serving as a building block in alignment-free metrics (e.g., Jaccard similarity over shared *k*-mers).
- Supporting pattern mining in metagenomic or pangenome collections.
> **Note**: This package focuses on *exact* suffix matching; probabilistic or approximate extensions are out of scope.