036d044291
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
33 lines
2.2 KiB
Markdown
33 lines
2.2 KiB
Markdown
# Kmers and super-kmers
|
||
|
||
## Kmers
|
||
|
||
A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
|
||
|
||
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
|
||
- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
|
||
|
||
## Super-kmers
|
||
|
||
A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides, sharing the same **canonical minimizer**. The **canonical minimizer** of a kmer is the m-mer (m < k) whose canonical hash `hash_kmer(min(m-mer, revcomp(m-mer)))` is smallest over all m-mers in the kmer window. The hash function is a `mix64`-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.
|
||
|
||
### Canonical super-kmers
|
||
|
||
A **canonical super-kmer** is the lexicographic minimum of a super-kmer and its reverse complement:
|
||
|
||
```
|
||
canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
|
||
```
|
||
|
||
When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.
|
||
|
||
### Expected length of a super-kmer
|
||
|
||
For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:
|
||
|
||
$$L_{\text{nt}} = \frac{k-m+2}{2} + k - 1$$
|
||
|
||
For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.[^superkmer_length]
|
||
|
||
[^superkmer_length]: The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf].
|