SuperKmer — implementation
Memory layout
A super-kmer is stored as a 32-bit header followed by a byte-aligned nucleotide sequence (2 bits/base, nucleotide 0 at the MSB of the first byte):
| Field | Bits | Role |
|---|---|---|
| COUNT | 24 | Occurrence count (≤ 16 M) |
| NKMERS | 8 | Number of kmers (= seq_length − k + 1, range 1–255) |
Bit layout (MSB to LSB): [31:8] COUNT [7:0] NKMERS
NKMERS is stored as a raw u8 in kmer units, not nucleotides. The nucleotide length is recovered as NKMERS + k − 1. This avoids the awkward wrapping convention (0 = 256) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:
| unit | u8 covers | max nucleotides |
|---|---|---|
| nucleotides | 255 nt | 225 kmers |
| kmers | 255 kmers | 285 nt |
The public accessors:
fn n_kmers(&self) -> usize { (self.0 & 0xFF) as usize }
fn seql(&self) -> usize { self.n_kmers() + K - 1 }
fn count(&self) -> u32 { self.0 >> 8 }
fn increment(&mut self) { self.0 += 1 << 8; }
fn add(&mut self, n: u32) { self.0 += n << 8; }
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
ASCII encoding and decoding
Two lookup tables handle ASCII ↔ 2-bit conversion:
ENC: [u8; 32]— indexed byb & 0x1F(lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.DEC4: [u32; 256]— maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endianu32. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
Encoding 4 nucleotides into one byte:
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
Decoding one byte into 4 ASCII characters:
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
Reverse complement
The reverse complement is computed in place with zero allocation in two steps.
Step 1 — byte swap with REVCOMP4. A 256-byte lookup table REVCOMP4 maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying REVCOMP4 to each:
const fn revcomp4(x: u8) -> u8 {
let x = !x; // complement all bases
let x = (x >> 4) | (x << 4); // swap nibbles
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
x
}
REVCOMP4 is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
Step 2 — realignment. After step 1, padding = n × 8 − seql × 2 spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using BitSlice<u8, Msb0>::rotate_left(padding) from the bitvec crate, which is SIMD-accelerated. The trailing padding bits are then zeroed:
let seql = self.n_kmers() + k - 1;
shift = n * 8 - seql * 2 // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)
Msb0 ordering makes the bit layout hardware-independent.
Algorithm — Super-kmer canonisation
procedure SuperKmerCanonical(seq, SEQL):
for i ← 0 to SEQL − 1:
fwd ← nucleotide(seq, i)
rev ← complement(nucleotide(seq, SEQL − 1 − i))
if fwd < rev: return seq -- forward is canonical
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
return seq -- palindrome: either orientation valid
Minimizer sliding window
Super-kmers are built by SuperKmerIter (crate obiskbuilder), which maintains the current minimizer with a monotonic deque over a sliding window of W = k − m + 1 m-mer positions.
Each deque entry stores:
| Field | Type | Purpose |
|---|---|---|
position |
usize | 0-based start of this m-mer in the segment |
canonical |
u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
hash |
u64 | \(H(\text{canonical})\) — ordering key for random minimizer selection |
The hash \(H\) is the seeded splitmix64 finalizer (see Minimizer selection):
fn hash_mmer(canonical: u64) -> u64 {
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
let x = x ^ (x >> 30);
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
let x = x ^ (x >> 27);
let x = x.wrapping_mul(0x94d049bb133111eb);
x ^ (x >> 31)
}
On each new nucleotide, once the window is full, the deque is updated:
Algorithm — minimizer deque update
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
-- pop dominated entries from the back
while deque.back.hash ≥ hash:
deque.pop_back()
deque.push_back({position, canonical, hash})
-- evict expired entries from the front
while deque.front.position + k < received:
deque.pop_front()
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
A super-kmer boundary is emitted when the minimizer changes: deque.front.hash ≠ prev_hash. The canonical field of the front entry is not used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key \(H(\text{canonical})\) can be recomputed independently at routing time from the stored minimizer_pos, without inheriting the minimum-order-statistic bias (see Minimizer selection — partition key independence).
Kmer extraction
A k-mer is extracted from a super-kmer with SuperKmer::kmer(i, k), which returns a Kmer — a left-aligned u64 newtype (see Kmer implementation):
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
The bit slice seq[i*2 .. (i+k)*2] (Msb0 order) is loaded as a big-endian u64 via bitvec::load_be, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
Algorithm — Super-kmer reverse complement
procedure SuperKmerRevcomp(seq, SEQL):
seql ← NKMERS + k − 1 -- nucleotide length
n ← ⌈seql / 4⌉ -- number of bytes
shift ← n × 8 − seql × 2 -- padding bits to flush
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
lo ← 0 ; hi ← n − 1
while lo < hi:
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
lo ← lo + 1 ; hi ← hi − 1
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
if shift > 0:
bits.rotate_left(shift)
bits[n×8 − shift .. n×8].fill(0)