mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

1.7 KiB

Raw Blame History

Suffix Array Implementation for Biological Sequences

This Go package (obisuffix) provides a suffix array data structure tailored for biological sequence analysis. It supports efficient lexicographic sorting and common-prefix computation over all suffixes of a set of sequences.

Core Types

Suffix: Represents one suffix by storing the sequence index (Idx) and starting position (Pos).
SuffixArray: Holds a collection of Suffix, the original sequences (Sequences), and cached common-prefix lengths (Common).

Key Functions

BuildSuffixArray(data): Constructs a suffix array by enumerating all suffixes from all input sequences, then sorts them lexicographically using a custom comparator (SuffixLess).
CommonSuffix(): Computes the length of shared prefix between each adjacent pair in the sorted suffix array (i.e., LCP-like values), caching results for reuse.
String(): Returns a human-readable table with columns: Common, sequence index, position, and suffix string.

Semantic Features

Lexicographic ordering: Suffixes are sorted by their nucleotide/amino-acid content; ties break first by shorter length, then lower index, finally earlier position.
Efficiency: Avoids redundant comparisons via memoization of Common values and stable sorting.
Biological relevance: Designed for use with obiseq.BioSequenceSlice, supporting DNA, RNA, or protein sequences.
Transparency: The String() method enables quick inspection of suffix relationships and overlaps.

This structure is foundational for tasks like repeat detection, alignment-free comparison, or pattern mining in multi-sequence datasets.

1.7 KiB Raw Blame History

Suffix Array Implementation for Biological Sequences

Core Types

Key Functions

Semantic Features

1.7 KiB

Raw Blame History