mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
40 lines
1.8 KiB
Markdown
40 lines
1.8 KiB
Markdown
|
|
# Semantic Description of `obikmer` Package Functionalities
|
||
|
|
|
||
|
|
The `obikmer` package provides tools for **super k-mer extraction and minimizer-based sequence analysis** in bioinformatics.
|
||
|
|
|
||
|
|
## Core Concepts
|
||
|
|
|
||
|
|
A **super k-mer** is a maximal contiguous subsequence of DNA where *all* embedded *k*-mers share the **same minimizer**—a compact representative (typically lexicographically minimal) of *m*-mers, considering both forward and reverse-complement strands.
|
||
|
|
|
||
|
|
## Key Functions & Features
|
||
|
|
|
||
|
|
- **`IterSuperKmers(seq, k, m)`**:
|
||
|
|
An iterator over all super *k*-mers in input sequence `seq`, parameterized by:
|
||
|
|
- `k`: length of embedded *k*-mers,
|
||
|
|
- `m`: size of minimizer window (`m ≤ k`).
|
||
|
|
Yields structured objects with:
|
||
|
|
- `Sequence`: the super *k*-mer substring,
|
||
|
|
- `Start`/`End`: genomic coordinates (0-based half-open),
|
||
|
|
- `Minimizer`: canonical hash of the shared minimizer.
|
||
|
|
|
||
|
|
- **`ExtractSuperKmers(...)`**:
|
||
|
|
Synchronous counterpart returning a slice of all super *k*-mers.
|
||
|
|
|
||
|
|
## Verified Properties (via Tests)
|
||
|
|
|
||
|
|
1. **Boundary correctness**: Extracted subsequences match `seq[start:end]`.
|
||
|
|
2. **Consistency between iterator and slice versions**: Both APIs produce identical results.
|
||
|
|
3. **Bijection property**:
|
||
|
|
- Each unique super *k*-mer sequence maps to exactly one minimizer.
|
||
|
|
- All embedded *k*-mers within a super *k-mer* share the same minimizer.
|
||
|
|
|
||
|
|
## Implementation Notes
|
||
|
|
|
||
|
|
- Minimizers are computed canonically (min of forward and reverse-complement encodings).
|
||
|
|
- Uses base encoding via `__single_base_code__` (assumed helper mapping A/C/G/T → 0/1/2/3).
|
||
|
|
- Tests cover simple, homopolymer-rich, and complex genomic patterns.
|
||
|
|
|
||
|
|
## Design Rationale
|
||
|
|
|
||
|
|
Super *k*-mers enable efficient compression, indexing (e.g., in minimizer spaces), and alignment-free comparisons—crucial for scalable genomic analysis.
|