mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,32 @@
|
||||
# Super K-mers Extraction Module (`obikmer`)
|
||||
|
||||
This Go package provides efficient tools for extracting **super k-mers** from DNA sequences using *minimizer-based sliding windows*. Super k-mers are maximal contiguous subsequences sharing the same minimal canonical minimizer in a window of size `k`.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`IterSuperKmers(seq, k, m)`**
|
||||
Returns an iterator over `SuperKmer` structs. Each struct contains:
|
||||
- `Start`, `End`: genomic positions of the super k-mer in the original sequence
|
||||
- `Minimizer`: canonical minimizer value (uint64) for that segment
|
||||
- `Sequence`: the actual DNA subsequence
|
||||
|
||||
- **`SuperKmer.ToBioSequence(...)`**
|
||||
Converts a raw `SuperKmer` into an enriched `obiseq.BioSequence`, embedding metadata:
|
||||
- ID: `{parentID}_superkmer_{start}_{end}`
|
||||
- Attributes: minimizer sequence (`minimizer_seq`), value, `k`, `m`, positions, and parent ID
|
||||
|
||||
- **`SuperKmerWorker(k, m)`**
|
||||
A `SeqWorker` adapter for pipeline integration (e.g., with `obiiter`). Processes a full BioSequence and returns all extracted super k-mers as a slice of `BioSequence`s.
|
||||
|
||||
## Algorithm Highlights
|
||||
|
||||
- Uses **canonical minimizers** (forward/reverse-complement minimum) to ensure strand-invariance
|
||||
- Maintains a monotonic deque for efficient *sliding-window minimizer* tracking (O(n) time complexity)
|
||||
- Supports DNA bases `A/C/G/T/U` case-insensitively via bitmasking (`seq[i] & 31`)
|
||||
- Enforces parameter constraints: `1 ≤ m < k ≤ 31`, sequence length ≥ `k`
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Read partitioning in metagenomics (e.g., for error correction or clustering)
|
||||
- Efficient k-mer space segmentation without storing all individual kmers
|
||||
- Integration into modular bioinformatics pipelines via `SeqWorker` interface
|
||||
Reference in New Issue
Block a user