Files
obikmer/docmd/implementation/kmer.md
T
2026-04-19 12:17:16 +02:00

2.5 KiB
Raw Blame History

Kmer — implementation

Memory layout

Kmer is a #[repr(transparent)] newtype over u64:

#[repr(transparent)]
pub struct Kmer(u64);

Nucleotides are packed 2 bits each, left-aligned, MSB-first. Nucleotide 0 occupies bits 6362; nucleotide i occupies bits 632i and 622i. The low 642k bits are always zero. k is not stored — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.

6362 6160 632(k1)1 to 632(k1) 632k down to 0
nt 0 nt 1 nt k1 zero padding

Encoding

Kmer::from_ascii(ascii, k) encodes the first k bytes of an ASCII slice using the shared ENC table (see SuperKmer — ASCII encoding):

for i in 0..k {
    val = (val << 2) | encode_base(ascii[i]) as u64;
}
Kmer(val << (64 - 2 * k))

Zero allocation — result lives on the stack.

Decoding

write_ascii(k, buf) appends k ASCII characters to a caller-supplied Vec<u8> using the shared DEC4 table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.

to_ascii(k) is a convenience wrapper that allocates and returns a Vec<u8>; intended for tests and display only.

Reverse complement

Computed as pure arithmetic — no lookup table, no memory access:

let x = !self.0;                                                               // complement
let x = x.swap_bytes();                                                        // reverse bytes
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2);   // swap 2-bit groups
Kmer(x << (64 - 2 * k))

After complementing, bytes are reversed (swap_bytes), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.

Canonical form

pub fn canonical(&self, k: usize) -> Self {
    let rc = self.revcomp(k);
    if self.0 <= rc.0 { *self } else { rc }
}

Lexicographic minimum of forward and reverse-complement, comparing the raw u64 values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.