# SuperKmer — implementation ## Memory layout `SuperKmer` holds two separate fields: ```rust pub struct SuperKmer { pub(crate) count: u32, pub(crate) inner: PackedSeq, } ``` `PackedSeq` stores a 2-bit packed DNA sequence as a heap-allocated `Box<[u8]>` plus a `tail: u8` field: | Field | Type | Role | |-------|------|------| | `tail` | `u8` | Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity | | `seq` | `Box<[u8]>` | 2-bit packed bytes, nucleotide 0 at bits 7–6 of `seq[0]` | Nucleotide length is recovered without storing it explicitly: ```text seql = (seq.len() - 1) * 4 + tail_count(tail) ``` There is no packed header word — `count` and the sequence live in separate fields. The on-disk binary format (produced by `write_to_binary`) is: ```text [varint(count)] [u8: seql − k] [packed bytes…] ``` `seql − k` fits in a `u8` when `n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)`. If a super-kmer exceeds 256 kmers, `write_to_binary` splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by `read_from_binary`. The public accessors operate on the struct fields directly: ```rust fn seql(&self) -> usize { self.inner.seql() } fn count(&self) -> u32 { self.count } fn increment(&mut self) { self.count += 1; } fn add(&mut self, n: u32) { self.count += n; } fn set_count(&mut self, n: u32) { self.count = n; } ``` ## ASCII encoding and decoding Two lookup tables handle ASCII ↔ 2-bit conversion: - **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically. - **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters. Encoding 4 nucleotides into one byte: ```rust byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F] ``` Decoding one byte into 4 ASCII characters: ```rust DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII ``` ## Reverse complement The reverse complement is computed **in place** with zero allocation in two steps. **Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each: ```rust const fn revcomp4(x: u8) -> u8 { let x = !x; // complement all bases let x = (x >> 4) | (x << 4); // swap nibbles let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups x } ``` `REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values. **Step 2 — realignment.** After step 1, `padding = n × 8 − seql × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed: ```rust let seql = self.seql(); shift = n * 8 - seql * 2 // number of padding bits bits.rotate_left(shift) bits[len - shift..].fill(false) ``` `Msb0` ordering makes the bit layout hardware-independent. !!! abstract "Algorithm — Super-kmer canonisation" ```text procedure SuperKmerCanonical(seq, SEQL): for i ← 0 to SEQL − 1: fwd ← nucleotide(seq, i) rev ← complement(nucleotide(seq, SEQL − 1 − i)) if fwd < rev: return seq -- forward is canonical if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical return seq -- palindrome: either orientation valid ``` ## Minimizer sliding window Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which tracks the current minimizer with a **monotonic deque** (`Ring`) inside `RollingStat`, a rolling-window entropy and minimizer tracker. Each deque entry stores: | Field | Type | Purpose | |------------|-------|----------------------------------------------| | `position` | usize | 0-based start of this m-mer in the segment | | `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key | | `hash` | u64 | `hash_kmer(canonical << (64 − 2m))` — ordering key for random minimizer selection | The hash uses the seeded splitmix64 finalizer (`mix64(raw ^ 0x9e3779b97f4a7c15)`), the same function as `kmer::hash_kmer`. On each new nucleotide, once the window is full, the deque is updated: !!! abstract "Algorithm — minimizer deque update" ```text procedure UpdateMinimizer(deque, position, canonical, hash, k, received): -- pop dominated entries from the back while deque.back.hash ≥ hash: deque.pop_back() deque.push_back({position, canonical, hash}) -- evict expired entries from the front while deque.front.position + k < received: deque.pop_front() ``` The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide. A super-kmer boundary is emitted when the minimizer changes: `current_minimizer != prev_minimizer`. `SuperKmerIter` also emits a boundary when: - entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1) - super-kmer length reaches 256 nucleotides (cursor retreated by k) ## Kmer extraction A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i)`, which delegates to `PackedSeq::extract::(i)` and returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)): ```rust pub fn kmer(&self, i: usize) -> Result ``` The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation. --- !!! abstract "Algorithm — Super-kmer reverse complement" ```text procedure SuperKmerRevcomp(seq, SEQL): seql ← nucleotide length n ← ⌈seql / 4⌉ -- number of bytes shift ← n × 8 − seql × 2 -- padding bits to flush -- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table) lo ← 0 ; hi ← n − 1 while lo < hi: seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]] lo ← lo + 1 ; hi ← hi − 1 if lo == hi: seq[lo] ← REVCOMP4[seq[lo]] -- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec) if shift > 0: bits.rotate_left(shift) bits[n×8 − shift .. n×8].fill(0) ```