Files
obikmer/docmd/theory/kmers.md
T
Eric Coissac 380b5a6f94 📖 Update super-kmer theory and implementation to prefer non-degenerate m-mers
- Update super-kmer definition in `kmERS.md` to specify that non-degenerate m-mers are preferred over degenerate ones (degeneracy = homopolymer).
- Refactor `superkmer.rs`: change `.canonical()` to mutate in-place and return bool.
- Add `m` field & canonical-aware minimizer position calculation to SuperKmerIter in obiskbuilder.
- Add helper functions `is_degenerate` and minimizer comparison logic to rolling_stat.rs for consistent tie-breaking.
- Minor formatting cleanup in superkmer command and chunk processing.
2026-04-20 17:50:09 +02:00

2.3 KiB
Raw Blame History

Kmers and super-kmers

Kmers

A kmer is a DNA subsequence of fixed length k. Two constraints govern the choice of k:

  • k ∈ [11, 31]: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
  • k is odd: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form min(kmer, revcomp(kmer)) is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.

Super-kmers

A super-kmer is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k1 nucleotides. Each kmer of the run carries the same canonical minimizer. The canonical minimizer of a kmer is the smallest value of min(m-mer, revcomp(m-mer)) over all m-mers within the kmer (m < k, m odd), with the constraint that non-degenerate m-mers are always preferred over degenerate ones. A degenerate m-mer is one composed of a single repeated nucleotide (all-A, all-C, all-G, or all-T); such m-mers are selected only if no non-degenerate candidate exists in the window.

Canonical super-kmers

A canonical super-kmer is the lexicographic minimum of a super-kmer and its reverse complement:

canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))

When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.

Expected length of a super-kmer

For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(km+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (km+2)/2. A run of n k-mers spans n + k 1 nucleotides, giving:

L_{\text{nt}} = \frac{k-m+2}{2} + k - 1

For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.1


  1. The expected length formula and the density approximation 2/(km+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf]. ↩︎