Files
obikmer/docmd/implementation/superkmer.md
T
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00

169 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SuperKmer — implementation
## Memory layout
A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte):
| Field | Bits | Role |
|-------|------|------|
| COUNT | 24 | Occurrence count (≤ 16 M) |
| NKMERS | 8 | Number of kmers (= seq_length k + 1, range 1255) |
Bit layout (MSB to LSB): `[31:8] COUNT [7:0] NKMERS`
NKMERS is stored as a raw `u8` in **kmer units**, not nucleotides. The nucleotide length is recovered as `NKMERS + k 1`. This avoids the awkward wrapping convention (`0 = 256`) that would be needed if nucleotide length were stored directly, and gains k1 = 30 units of headroom:
| unit | u8 covers | max nucleotides |
|---|---|---|
| nucleotides | 255 nt | 225 kmers |
| **kmers** | **255 kmers** | **285 nt** |
The public accessors:
```rust
fn n_kmers(&self) -> usize { (self.0 & 0xFF) as usize }
fn seql(&self) -> usize { self.n_kmers() + K - 1 }
fn count(&self) -> u32 { self.0 >> 8 }
fn increment(&mut self) { self.0 += 1 << 8; }
fn add(&mut self, n: u32) { self.0 += n << 8; }
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
```
In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
## ASCII encoding and decoding
Two lookup tables handle ASCII ↔ 2-bit conversion:
- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
Encoding 4 nucleotides into one byte:
```rust
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
```
Decoding one byte into 4 ASCII characters:
```rust
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
```
## Reverse complement
The reverse complement is computed **in place** with zero allocation in two steps.
**Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
```rust
const fn revcomp4(x: u8) -> u8 {
let x = !x; // complement all bases
let x = (x >> 4) | (x << 4); // swap nibbles
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
x
}
```
`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
**Step 2 — realignment.** After step 1, `padding = n × 8 seql × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
```rust
let seql = self.n_kmers() + k - 1;
shift = n * 8 - seql * 2 // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)
```
`Msb0` ordering makes the bit layout hardware-independent.
!!! abstract "Algorithm — Super-kmer canonisation"
```text
procedure SuperKmerCanonical(seq, SEQL):
for i ← 0 to SEQL 1:
fwd ← nucleotide(seq, i)
rev ← complement(nucleotide(seq, SEQL 1 i))
if fwd < rev: return seq -- forward is canonical
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
return seq -- palindrome: either orientation valid
```
## Minimizer sliding window
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k m + 1 m-mer positions.
Each deque entry stores:
| Field | Type | Purpose |
|------------|-------|----------------------------------------------|
| `position` | usize | 0-based start of this m-mer in the segment |
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
| `hash` | u64 | $H(\text{canonical})$ — ordering key for random minimizer selection |
The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
```rust
fn hash_mmer(canonical: u64) -> u64 {
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
let x = x ^ (x >> 30);
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
let x = x ^ (x >> 27);
let x = x.wrapping_mul(0x94d049bb133111eb);
x ^ (x >> 31)
}
```
On each new nucleotide, once the window is full, the deque is updated:
!!! abstract "Algorithm — minimizer deque update"
```text
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
-- pop dominated entries from the back
while deque.back.hash ≥ hash:
deque.pop_back()
deque.push_back({position, canonical, hash})
-- evict expired entries from the front
while deque.front.position + k < received:
deque.pop_front()
```
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
## Kmer extraction
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
```rust
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
```
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
---
!!! abstract "Algorithm — Super-kmer reverse complement"
```text
procedure SuperKmerRevcomp(seq, SEQL):
seql ← NKMERS + k 1 -- nucleotide length
n ← ⌈seql / 4⌉ -- number of bytes
shift ← n × 8 seql × 2 -- padding bits to flush
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
lo ← 0 ; hi ← n 1
while lo < hi:
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
lo ← lo + 1 ; hi ← hi 1
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
if shift > 0:
bits.rotate_left(shift)
bits[n×8 shift .. n×8].fill(0)
```