Files
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

172 lines
7.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SuperKmer — implementation
## Memory layout
`SuperKmer` holds two separate fields:
```rust
pub struct SuperKmer {
pub(crate) count: u32,
pub(crate) inner: PackedSeq,
}
```
`PackedSeq` stores a 2-bit packed DNA sequence as a heap-allocated `Box<[u8]>` plus a `tail: u8` field:
| Field | Type | Role |
|-------|------|------|
| `tail` | `u8` | Number of valid nucleotides in the last byte: 0 encodes 4, 13 are identity |
| `seq` | `Box<[u8]>` | 2-bit packed bytes, nucleotide 0 at bits 76 of `seq[0]` |
Nucleotide length is recovered without storing it explicitly:
```text
seql = (seq.len() - 1) * 4 + tail_count(tail)
```
There is no packed header word — `count` and the sequence live in separate fields.
The on-disk binary format (produced by `write_to_binary`) is:
```text
[varint(count)] [u8: seql k] [packed bytes…]
```
`seql k` fits in a `u8` when `n_kmers = seql k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)`. If a super-kmer exceeds 256 kmers, `write_to_binary` splits it into overlapping chunks (k1 nucleotide overlap, same count per chunk), each a self-contained record readable by `read_from_binary`.
The public accessors operate on the struct fields directly:
```rust
fn seql(&self) -> usize { self.inner.seql() }
fn count(&self) -> u32 { self.count }
fn increment(&mut self) { self.count += 1; }
fn add(&mut self, n: u32) { self.count += n; }
fn set_count(&mut self, n: u32) { self.count = n; }
```
## ASCII encoding and decoding
Two lookup tables handle ASCII ↔ 2-bit conversion:
- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
Encoding 4 nucleotides into one byte:
```rust
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
```
Decoding one byte into 4 ASCII characters:
```rust
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
```
## Reverse complement
The reverse complement is computed **in place** with zero allocation in two steps.
**Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
```rust
const fn revcomp4(x: u8) -> u8 {
let x = !x; // complement all bases
let x = (x >> 4) | (x << 4); // swap nibbles
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
x
}
```
`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
**Step 2 — realignment.** After step 1, `padding = n × 8 seql × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
```rust
let seql = self.seql();
shift = n * 8 - seql * 2 // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)
```
`Msb0` ordering makes the bit layout hardware-independent.
!!! abstract "Algorithm — Super-kmer canonisation"
```text
procedure SuperKmerCanonical(seq, SEQL):
for i ← 0 to SEQL 1:
fwd ← nucleotide(seq, i)
rev ← complement(nucleotide(seq, SEQL 1 i))
if fwd < rev: return seq -- forward is canonical
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
return seq -- palindrome: either orientation valid
```
## Minimizer sliding window
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which tracks the current minimizer with a **monotonic deque** (`Ring<MmerItem, 32>`) inside `RollingStat`, a rolling-window entropy and minimizer tracker.
Each deque entry stores:
| Field | Type | Purpose |
|------------|-------|----------------------------------------------|
| `position` | usize | 0-based start of this m-mer in the segment |
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
| `hash` | u64 | `hash_kmer(canonical << (64 2m))` — ordering key for random minimizer selection |
The hash uses the seeded splitmix64 finalizer (`mix64(raw ^ 0x9e3779b97f4a7c15)`), the same function as `kmer::hash_kmer`.
On each new nucleotide, once the window is full, the deque is updated:
!!! abstract "Algorithm — minimizer deque update"
```text
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
-- pop dominated entries from the back
while deque.back.hash ≥ hash:
deque.pop_back()
deque.push_back({position, canonical, hash})
-- evict expired entries from the front
while deque.front.position + k < received:
deque.pop_front()
```
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
A super-kmer boundary is emitted when the minimizer changes: `current_minimizer != prev_minimizer`. `SuperKmerIter` also emits a boundary when:
- entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k1)
- super-kmer length reaches 256 nucleotides (cursor retreated by k)
## Kmer extraction
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i)`, which delegates to `PackedSeq::extract::<KLen>(i)` and returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
```rust
pub fn kmer(&self, i: usize) -> Result<Kmer, KmerError>
```
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
---
!!! abstract "Algorithm — Super-kmer reverse complement"
```text
procedure SuperKmerRevcomp(seq, SEQL):
seql ← nucleotide length
n ← ⌈seql / 4⌉ -- number of bytes
shift ← n × 8 seql × 2 -- padding bits to flush
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
lo ← 0 ; hi ← n 1
while lo < hi:
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
lo ← lo + 1 ; hi ← hi 1
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
if shift > 0:
bits.rotate_left(shift)
bits[n×8 shift .. n×8].fill(0)
```