mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
33 lines
1.7 KiB
Markdown
33 lines
1.7 KiB
Markdown
|
|
# Super K-mers Extraction Module (`obikmer`)
|
||
|
|
|
||
|
|
This Go package provides efficient tools for extracting **super k-mers** from DNA sequences using *minimizer-based sliding windows*. Super k-mers are maximal contiguous subsequences sharing the same minimal canonical minimizer in a window of size `k`.
|
||
|
|
|
||
|
|
## Core Functionality
|
||
|
|
|
||
|
|
- **`IterSuperKmers(seq, k, m)`**
|
||
|
|
Returns an iterator over `SuperKmer` structs. Each struct contains:
|
||
|
|
- `Start`, `End`: genomic positions of the super k-mer in the original sequence
|
||
|
|
- `Minimizer`: canonical minimizer value (uint64) for that segment
|
||
|
|
- `Sequence`: the actual DNA subsequence
|
||
|
|
|
||
|
|
- **`SuperKmer.ToBioSequence(...)`**
|
||
|
|
Converts a raw `SuperKmer` into an enriched `obiseq.BioSequence`, embedding metadata:
|
||
|
|
- ID: `{parentID}_superkmer_{start}_{end}`
|
||
|
|
- Attributes: minimizer sequence (`minimizer_seq`), value, `k`, `m`, positions, and parent ID
|
||
|
|
|
||
|
|
- **`SuperKmerWorker(k, m)`**
|
||
|
|
A `SeqWorker` adapter for pipeline integration (e.g., with `obiiter`). Processes a full BioSequence and returns all extracted super k-mers as a slice of `BioSequence`s.
|
||
|
|
|
||
|
|
## Algorithm Highlights
|
||
|
|
|
||
|
|
- Uses **canonical minimizers** (forward/reverse-complement minimum) to ensure strand-invariance
|
||
|
|
- Maintains a monotonic deque for efficient *sliding-window minimizer* tracking (O(n) time complexity)
|
||
|
|
- Supports DNA bases `A/C/G/T/U` case-insensitively via bitmasking (`seq[i] & 31`)
|
||
|
|
- Enforces parameter constraints: `1 ≤ m < k ≤ 31`, sequence length ≥ `k`
|
||
|
|
|
||
|
|
## Use Cases
|
||
|
|
|
||
|
|
- Read partitioning in metagenomics (e.g., for error correction or clustering)
|
||
|
|
- Efficient k-mer space segmentation without storing all individual kmers
|
||
|
|
- Integration into modular bioinformatics pipelines via `SeqWorker` interface
|