0d9be53d1f
Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.
35 lines
2.4 KiB
Markdown
35 lines
2.4 KiB
Markdown
# Kmers and super-kmers
|
||
|
||
## Kmers
|
||
|
||
A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
|
||
|
||
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k < 11 yields insufficient specificity).
|
||
- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
|
||
|
||
Both constraints are **enforced at CLI entry** by `CommonArgs::validate()` in `superkmer` and `index`. Passing an invalid k exits immediately with an error message.
|
||
|
||
## Super-kmers
|
||
|
||
A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides, sharing the same **canonical minimizer**. The **canonical minimizer** of a kmer is the m-mer (m < k) whose canonical hash `hash_kmer(min(m-mer, revcomp(m-mer)))` is smallest over all m-mers in the kmer window. The hash function is a `mix64`-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.
|
||
|
||
### Canonical super-kmers
|
||
|
||
A **canonical super-kmer** is the lexicographic minimum of a super-kmer and its reverse complement:
|
||
|
||
```
|
||
canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
|
||
```
|
||
|
||
When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.
|
||
|
||
### Expected length of a super-kmer
|
||
|
||
For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:
|
||
|
||
$$L_{\text{nt}} = \frac{k-m+2}{2} + k - 1$$
|
||
|
||
For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.[^superkmer_length]
|
||
|
||
[^superkmer_length]: The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf].
|