Kmer — implementation
Types and layout
KmerOf<L> is a #[repr(transparent)] newtype over u64 parameterized by a KmerLength marker:
#[repr(transparent)]
pub struct KmerOf<L: KmerLength>(u64, PhantomData<L>);
Three marker types implement KmerLength:
| Marker | len() source |
Used for |
|---|---|---|
KLen |
params::k() |
k-mers |
MLen |
params::m() |
minimizers |
ConstLen<N> |
const generic N |
tests |
Public aliases:
pub type Kmer = KmerOf<KLen>; // k-mer, global k
pub type Minimizer = CanonicalKmerOf<MLen>; // canonical m-mer, global m
Nucleotides are packed 2 bits each, left-aligned, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is not stored — every operation reads it from L::len().
| 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 |
|---|---|---|---|---|
| nt 0 | nt 1 | … | nt k−1 | zero padding |
Global parameters
params::set_k(k) / params::k() and params::set_m(m) / params::m() are backed by OnceLock<usize> in production (write-once, panic on conflict) and by thread_local! { Cell<usize> } in test builds (per-thread, freely writable). params::init(k, m) sets both in one call.
Encoding
KmerOf::<L>::from_ascii(ascii) encodes the first L::len() bytes using the shared ENC table (see SuperKmer — ASCII encoding):
for i in 0..k {
val = (val << 2) | encode_base(ascii[i]) as u64;
}
KmerOf(val << (64 - 2 * k), PhantomData)
Zero allocation — result lives on the stack.
Decoding
write_ascii(writer) writes k ASCII characters to any W: Write using the shared DEC4 table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.
to_ascii() is a convenience wrapper that allocates and returns a Vec<u8>; intended for tests and display only.
Reverse complement
Computed as pure arithmetic — no lookup table, no memory access:
let x = !self.0; // complement
let x = x.swap_bytes(); // reverse bytes
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups
KmerOf(x << (64 - 2 * k), PhantomData)
After complementing, bytes are reversed (swap_bytes), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
Canonical form and CanonicalKmerOf
canonical() returns a CanonicalKmerOf<L> — a distinct newtype that carries the same u64 layout but enforces the invariant that the stored value equals min(kmer, revcomp):
pub fn canonical(&self) -> CanonicalKmerOf<L> {
let rc = self.revcomp();
CanonicalKmerOf(if self.0 <= rc.0 { self.0 } else { rc.0 }, PhantomData)
}
Lexicographic minimum of forward and reverse-complement, comparing the raw u64 values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
CanonicalKmerOf::from_raw_unchecked(raw) is the only other public constructor, for trusted paths such as deserialisation.
Sliding window helpers
push_right(nuc) / push_left(nuc) shift the window by one base in O(1). is_overlapping(other) checks whether the last k−1 nucleotides of self equal the first k−1 of other.
Hashing
hash_kmer(raw: u64) -> u64 computes mix64(raw ^ 0x9e3779b97f4a7c15), the seeded splitmix64 finalizer. CanonicalKmerOf::seq_hash() delegates to hash_kmer.