obitools4/autodoc/docmd/pkg_obisuffix.md

# obisuffix: Suffix Array Package for Biological Sequence Analysis

The `obisuffix` package implements a suffix array tailored to biological sequences, enabling efficient lexicographic ordering and prefix analysis across multiple input sequences. It supports DNA, RNA, and protein data via integration with `obiseq.BioSequenceSlice`, making it suitable for repeat detection, k-mer mining, and alignment-free comparison workflows.

## Core Data Structures

### `Suffix`
Represents a single suffix by storing:
- `Idx int`: Index of the source sequence in the input slice.
- `Pos int`: Starting position (0-based) within that sequence.

### `SuffixArray`
Encapsulates:
- `Data []Suffix`: Sorted list of all suffixes.
- `Sequences obiseq.BioSequenceSlice`: Original input sequences (immutable reference).
- `Common []int`: Cached longest common prefix lengths between adjacent suffixes (`Data[i]` and `Data[i+1]`). Lazily computed.

## Public Functions

### `BuildSuffixArray(data obiseq.BioSequenceSlice) *SuffixArray`
Constructs a suffix array from one or more biological sequences:
- Enumerates **every** suffix of every sequence (i.e., for a sequence `s`, adds all `(Idx, Pos)` where `0 ≤ Pos < len(s)`).
- Sorts suffixes lexicographically using a deterministic comparator (`SuffixLess`):
  - Primary: Compare nucleotide/amino-acid content character-by-character.
  - Tie-breakers (if prefixes match up to min length):
    1. Shorter suffix comes first.
    2. Lower `Idx` (sequence index).
    3. Earlier `Pos`.
- Precomputes and caches the common-prefix array via internal call to `CommonSuffix()`.

### `(*SuffixArray) CommonSuffix() []int`
Computes the length of the longest common prefix (LCP) between each adjacent pair in `Data`:
- Returns a slice of length `len(Data)-1`, where `Common[i] = LCP(Data[i], Data[i+1])`.
- Uses memoization: If already computed (e.g., after `BuildSuffixArray`), returns the cached result.
- Avoids redundant comparisons by leveraging sorted order and early termination.

### `(*SuffixArray) String() string`
Returns a formatted, human-readable table for inspection:
- Columns: `Common`, `Idx`, `Pos`, and the actual suffix string (via `.Substring()`).
- Useful for debugging, educational demos, or visualizing repeat patterns and overlaps.

## Semantic Guarantees & Design Choices

- **Deterministic ordering**: Tie-breaking rules ensure reproducibility across runs and platforms.
- **Memory efficiency**: Stores only indices (not copies of suffixes), critical for large genomic datasets.
- **Biological fidelity**: Respects alphabet semantics (e.g., `A < C < G < T` for DNA) via underlying sequence comparison.
- **Lazy evaluation**: `CommonSuffix()` is invoked only when needed (e.g., on first call to `.String()`, or explicitly), avoiding unnecessary work.
- **Transparency**: All public fields are accessible, enabling downstream analysis without encapsulation barriers.

## Typical Use Cases

- Detecting tandem repeats or low-complexity regions across multi-sequence datasets.
- Building suffix arrays for *de novo* assembly validation or error correction.
- Serving as a building block in alignment-free metrics (e.g., Jaccard similarity over shared *k*-mers).
- Supporting pattern mining in metagenomic or pangenome collections.

> **Note**: This package focuses on *exact* suffix matching; probabilistic or approximate extensions are out of scope.