Files
obikmer/docmd/implementation/superkmer.md
T
2026-04-19 12:17:16 +02:00

5.2 KiB
Raw Blame History

SuperKmer — implementation

Memory layout

A super-kmer is stored as a 32-bit header followed by a byte-aligned nucleotide sequence (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):

Field Bits Role
COUNT 24 Occurrence count (≤ 16 M)
SEQL 8 Sequence length in nucleotides (1256)

Bit layout (MSB to LSB): [31:8] COUNT [7:0] SEQL

SEQL is stored as a raw u8: values 1255 represent lengths 1255; 0 represents 256 (wrapping convention). The public accessor returns a usize and performs the conversion:

fn seql(&self)               -> usize { if s == 0 { 256 } else { s as usize } }
fn count(&self)              -> u32   { self.0 >> 8 }
fn increment(&mut self)               { self.0 += 1 << 8; }
fn add(&mut self, n: u32)             { self.0 += n << 8; }
fn set_count(&mut self, n: u32)       { self.0 = (self.0 & 0xFF) | (n << 8); }

The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k1 overlap, preserving all kmers without duplication.

The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.

ASCII encoding and decoding

Two lookup tables handle ASCII ↔ 2-bit conversion:

  • ENC: [u8; 32] — indexed by b & 0x1F (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
  • DEC4: [u32; 256] — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian u32. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.

Encoding 4 nucleotides into one byte:

byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]

Decoding one byte into 4 ASCII characters:

DEC4[byte].to_be_bytes()   // [nuc0, nuc1, nuc2, nuc3] in ASCII

Reverse complement

The reverse complement is computed in place with zero allocation in two steps.

Step 1 — byte swap with REVCOMP4. A 256-byte lookup table REVCOMP4 maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying REVCOMP4 to each:

const fn revcomp4(x: u8) -> u8 {
    let x = !x;                                       // complement all bases
    let x = (x >> 4) | (x << 4);                     // swap nibbles
    let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2);  // swap 2-bit groups
    x
}

REVCOMP4 is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.

Step 2 — realignment. After step 1, padding = n × 8 SEQL × 2 spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using BitSlice<u8, Msb0>::rotate_left(padding) from the bitvec crate, which is SIMD-accelerated. The trailing padding bits are then zeroed:

shift = n * 8 - SEQL * 2          // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)

Msb0 ordering makes the bit layout hardware-independent.

!!! abstract "Algorithm — Super-kmer canonisation" text procedure SuperKmerCanonical(seq, SEQL): for i ← 0 to SEQL 1: fwd ← nucleotide(seq, i) rev ← complement(nucleotide(seq, SEQL 1 i)) if fwd < rev: return seq -- forward is canonical if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical return seq -- palindrome: either orientation valid

Kmer extraction

A k-mer is extracted from a super-kmer with SuperKmer::kmer(i, k), which returns a Kmer — a left-aligned u64 newtype (see Kmer implementation):

pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>

The bit slice seq[i*2 .. (i+k)*2] (Msb0 order) is loaded as a big-endian u64 via bitvec::load_be, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.


!!! abstract "Algorithm — Super-kmer reverse complement" ```text procedure SuperKmerRevcomp(seq, SEQL): n ← ⌈SEQL / 4⌉ -- number of bytes shift ← n × 8 SEQL × 2 -- padding bits to flush

    -- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
    lo ← 0 ; hi ← n  1
    while lo < hi:
        seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
        lo ← lo + 1 ; hi ← hi  1
    if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]

    -- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
    if shift > 0:
        bits.rotate_left(shift)
        bits[n×8  shift .. n×8].fill(0)
```