Files

T

Eric Coissac 036d044291 refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.

2026-05-26 10:04:25 +02:00

7.1 KiB

Raw Blame History

SuperKmer — implementation

Memory layout

SuperKmer holds two separate fields:

pub struct SuperKmer {
    pub(crate) count: u32,
    pub(crate) inner: PackedSeq,
}

PackedSeq stores a 2-bit packed DNA sequence as a heap-allocated Box<[u8]> plus a tail: u8 field:

Field	Type	Role
`tail`	`u8`	Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity
`seq`	`Box<[u8]>`	2-bit packed bytes, nucleotide 0 at bits 7–6 of `seq[0]`

Nucleotide length is recovered without storing it explicitly:

seql = (seq.len() - 1) * 4 + tail_count(tail)

There is no packed header word — count and the sequence live in separate fields.

The on-disk binary format (produced by write_to_binary) is:

[varint(count)] [u8: seql − k] [packed bytes…]

seql − k fits in a u8 when n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256). If a super-kmer exceeds 256 kmers, write_to_binary splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by read_from_binary.

The public accessors operate on the struct fields directly:

fn seql(&self)            -> usize { self.inner.seql() }
fn count(&self)           -> u32   { self.count }
fn increment(&mut self)            { self.count += 1; }
fn add(&mut self, n: u32)          { self.count += n; }
fn set_count(&mut self, n: u32)    { self.count = n; }

ASCII encoding and decoding

Two lookup tables handle ASCII ↔ 2-bit conversion:

ENC: [u8; 32] — indexed by b & 0x1F (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
DEC4: [u32; 256] — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian u32. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.

Encoding 4 nucleotides into one byte:

byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]

Decoding one byte into 4 ASCII characters:

DEC4[byte].to_be_bytes()   // [nuc0, nuc1, nuc2, nuc3] in ASCII

Reverse complement

The reverse complement is computed in place with zero allocation in two steps.

Step 1 — byte swap with REVCOMP4. A 256-byte lookup table REVCOMP4 maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying REVCOMP4 to each:

const fn revcomp4(x: u8) -> u8 {
    let x = !x;                                       // complement all bases
    let x = (x >> 4) | (x << 4);                     // swap nibbles
    let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2);  // swap 2-bit groups
    x
}

REVCOMP4 is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.

Step 2 — realignment. After step 1, padding = n × 8 − seql × 2 spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using BitSlice<u8, Msb0>::rotate_left(padding) from the bitvec crate, which is SIMD-accelerated. The trailing padding bits are then zeroed:

let seql = self.seql();
shift = n * 8 - seql * 2          // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)

Msb0 ordering makes the bit layout hardware-independent.

!!! abstract "Algorithm — Super-kmer canonisation" text procedure SuperKmerCanonical(seq, SEQL): for i ← 0 to SEQL − 1: fwd ← nucleotide(seq, i) rev ← complement(nucleotide(seq, SEQL − 1 − i)) if fwd < rev: return seq -- forward is canonical if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical return seq -- palindrome: either orientation valid

Minimizer sliding window

Super-kmers are built by SuperKmerIter (crate obiskbuilder), which tracks the current minimizer with a monotonic deque (Ring<MmerItem, 32>) inside RollingStat, a rolling-window entropy and minimizer tracker.

Each deque entry stores:

Field	Type	Purpose
`position`	usize	0-based start of this m-mer in the segment
`canonical`	u64	right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key
`hash`	u64	`hash_kmer(canonical << (64 − 2m))` — ordering key for random minimizer selection

The hash uses the seeded splitmix64 finalizer (mix64(raw ^ 0x9e3779b97f4a7c15)), the same function as kmer::hash_kmer.

On each new nucleotide, once the window is full, the deque is updated:

!!! abstract "Algorithm — minimizer deque update" ```text procedure UpdateMinimizer(deque, position, canonical, hash, k, received): -- pop dominated entries from the back while deque.back.hash ≥ hash: deque.pop_back() deque.push_back({position, canonical, hash})

    -- evict expired entries from the front
    while deque.front.position + k < received:
        deque.pop_front()
```

The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.

A super-kmer boundary is emitted when the minimizer changes: current_minimizer != prev_minimizer. SuperKmerIter also emits a boundary when:

entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1)
super-kmer length reaches 256 nucleotides (cursor retreated by k)

Kmer extraction

A k-mer is extracted from a super-kmer with SuperKmer::kmer(i), which delegates to PackedSeq::extract::<KLen>(i) and returns a Kmer — a left-aligned u64 newtype (see Kmer implementation):

pub fn kmer(&self, i: usize) -> Result<Kmer, KmerError>

The bit slice seq[i*2 .. (i+k)*2] (Msb0 order) is loaded as a u64 via bitvec::load_be, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.

!!! abstract "Algorithm — Super-kmer reverse complement" ```text procedure SuperKmerRevcomp(seq, SEQL): seql ← nucleotide length n ← ⌈seql / 4⌉ -- number of bytes shift ← n × 8 − seql × 2 -- padding bits to flush

    -- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
    lo ← 0 ; hi ← n − 1
    while lo < hi:
        seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
        lo ← lo + 1 ; hi ← hi − 1
    if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]

    -- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
    if shift > 0:
        bits.rotate_left(shift)
        bits[n×8 − shift .. n×8].fill(0)
```

7.1 KiB Raw Blame History Unescape Escape