mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.7 KiB
1.7 KiB
Suffix Array Implementation for Biological Sequences
This Go package (obisuffix) provides a suffix array data structure tailored for biological sequence analysis. It supports efficient lexicographic sorting and common-prefix computation over all suffixes of a set of sequences.
Core Types
Suffix: Represents one suffix by storing the sequence index (Idx) and starting position (Pos).SuffixArray: Holds a collection ofSuffix, the original sequences (Sequences), and cached common-prefix lengths (Common).
Key Functions
BuildSuffixArray(data): Constructs a suffix array by enumerating all suffixes from all input sequences, then sorts them lexicographically using a custom comparator (SuffixLess).CommonSuffix(): Computes the length of shared prefix between each adjacent pair in the sorted suffix array (i.e.,LCP-like values), caching results for reuse.String(): Returns a human-readable table with columns:Common, sequence index, position, and suffix string.
Semantic Features
- Lexicographic ordering: Suffixes are sorted by their nucleotide/amino-acid content; ties break first by shorter length, then lower index, finally earlier position.
- Efficiency: Avoids redundant comparisons via memoization of
Commonvalues and stable sorting. - Biological relevance: Designed for use with
obiseq.BioSequenceSlice, supporting DNA, RNA, or protein sequences. - Transparency: The
String()method enables quick inspection of suffix relationships and overlaps.
This structure is foundational for tasks like repeat detection, alignment-free comparison, or pattern mining in multi-sequence datasets.