mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
24 lines
1.7 KiB
Markdown
24 lines
1.7 KiB
Markdown
# Suffix Array Implementation for Biological Sequences
|
|
|
|
This Go package (`obisuffix`) provides a suffix array data structure tailored for biological sequence analysis. It supports efficient lexicographic sorting and common-prefix computation over all suffixes of a set of sequences.
|
|
|
|
## Core Types
|
|
|
|
- **`Suffix`**: Represents one suffix by storing the sequence index (`Idx`) and starting position (`Pos`).
|
|
- **`SuffixArray`**: Holds a collection of `Suffix`, the original sequences (`Sequences`), and cached common-prefix lengths (`Common`).
|
|
|
|
## Key Functions
|
|
|
|
- **`BuildSuffixArray(data)`**: Constructs a suffix array by enumerating *all* suffixes from all input sequences, then sorts them lexicographically using a custom comparator (`SuffixLess`).
|
|
- **`CommonSuffix()`**: Computes the length of shared prefix between each adjacent pair in the sorted suffix array (i.e., `LCP`-like values), caching results for reuse.
|
|
- **`String()`**: Returns a human-readable table with columns: `Common`, sequence index, position, and suffix string.
|
|
|
|
## Semantic Features
|
|
|
|
- **Lexicographic ordering**: Suffixes are sorted by their nucleotide/amino-acid content; ties break first by shorter length, then lower index, finally earlier position.
|
|
- **Efficiency**: Avoids redundant comparisons via memoization of `Common` values and stable sorting.
|
|
- **Biological relevance**: Designed for use with `obiseq.BioSequenceSlice`, supporting DNA, RNA, or protein sequences.
|
|
- **Transparency**: The `String()` method enables quick inspection of suffix relationships and overlaps.
|
|
|
|
This structure is foundational for tasks like repeat detection, alignment-free comparison, or pattern mining in multi-sequence datasets.
|