Kmers and super-kmers

Kmers

A kmer is a DNA subsequence of fixed length k. Two constraints govern the choice of k:

k ∈ [11, 31]: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k < 11 yields insufficient specificity).
k is odd: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form min(kmer, revcomp(kmer)) is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.

Both constraints are enforced at CLI entry by CommonArgs::validate() in superkmer and index. Passing an invalid k exits immediately with an error message.

Super-kmers

A super-kmer is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides, sharing the same canonical minimizer. The canonical minimizer of a kmer is the m-mer (m < k) whose canonical hash hash_kmer(min(m-mer, revcomp(m-mer))) is smallest over all m-mers in the kmer window. The hash function is a mix64-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.

Canonical super-kmers

A canonical super-kmer is the lexicographic minimum of a super-kmer and its reverse complement:

canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))

When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.

Expected length of a super-kmer

For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) (Golan & Shur 2025; Zheng et al. 2020)² ³, so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:

\[L_{\text{nt}} = \frac{k-m+2}{2} + k - 1\]

For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.¹

The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in (Zheng et al. 2020)² and (Golan & Shur 2025)³. ↩
Zheng, H., Kingsford, C. & Marçais, G. (2020). Improved design and analysis of practical minimizers. Bioinformatics (Oxford, England), 36, i119--i127. ↩↩
Golan, S. & Shur, A.M. (2025). Expected density of random minimizers. In: Lecture notes in computer science, Lecture notes in computer science. Springer Nature Switzerland, Cham, pp. 347--360. ↩↩