Files
obikmer/docmd/theory/kmers.md
T
2026-04-19 12:17:16 +02:00

33 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kmers and super-kmers
## Kmers
A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
## Super-kmers
A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k1 nucleotides. Each kmer of the run carries the same **canonical minimizer**. The **canonical minimizer** of a kmer is the smallest value of `min(m-mer, revcomp(m-mer))` over all m-mers within the kmer (m < k, m odd).
### Canonical super-kmers
A **canonical super-kmer** is the lexicographic minimum of a super-kmer and its reverse complement:
```
canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
```
When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.
### Expected length of a super-kmer
For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(km+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (km+2)/2. A run of n k-mers spans n + k 1 nucleotides, giving:
$$L_{\text{nt}} = \frac{k-m+2}{2} + k - 1$$
For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.[^superkmer_length]
[^superkmer_length]: The expected length formula and the density approximation 2/(km+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf].