refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
Eric Coissac
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
+48 -45
View File
@@ -2,36 +2,47 @@
## Memory layout
A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte):
| Field | Bits | Role |
|-------|------|------|
| COUNT | 24 | Occurrence count (≤ 16 M) |
| NKMERS | 8 | Number of kmers (= seq_length k + 1, range 1255) |
Bit layout (MSB to LSB): `[31:8] COUNT [7:0] NKMERS`
NKMERS is stored as a raw `u8` in **kmer units**, not nucleotides. The nucleotide length is recovered as `NKMERS + k 1`. This avoids the awkward wrapping convention (`0 = 256`) that would be needed if nucleotide length were stored directly, and gains k1 = 30 units of headroom:
| unit | u8 covers | max nucleotides |
|---|---|---|
| nucleotides | 255 nt | 225 kmers |
| **kmers** | **255 kmers** | **285 nt** |
The public accessors:
`SuperKmer` holds two separate fields:
```rust
fn n_kmers(&self) -> usize { (self.0 & 0xFF) as usize }
fn seql(&self) -> usize { self.n_kmers() + K - 1 }
fn count(&self) -> u32 { self.0 >> 8 }
fn increment(&mut self) { self.0 += 1 << 8; }
fn add(&mut self, n: u32) { self.0 += n << 8; }
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
pub struct SuperKmer {
pub(crate) count: u32,
pub(crate) inner: PackedSeq,
}
```
In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).
`PackedSeq` stores a 2-bit packed DNA sequence as a heap-allocated `Box<[u8]>` plus a `tail: u8` field:
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
| Field | Type | Role |
|-------|------|------|
| `tail` | `u8` | Number of valid nucleotides in the last byte: 0 encodes 4, 13 are identity |
| `seq` | `Box<[u8]>` | 2-bit packed bytes, nucleotide 0 at bits 76 of `seq[0]` |
Nucleotide length is recovered without storing it explicitly:
```text
seql = (seq.len() - 1) * 4 + tail_count(tail)
```
There is no packed header word — `count` and the sequence live in separate fields.
The on-disk binary format (produced by `write_to_binary`) is:
```text
[varint(count)] [u8: seql k] [packed bytes…]
```
`seql k` fits in a `u8` when `n_kmers = seql k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)`. If a super-kmer exceeds 256 kmers, `write_to_binary` splits it into overlapping chunks (k1 nucleotide overlap, same count per chunk), each a self-contained record readable by `read_from_binary`.
The public accessors operate on the struct fields directly:
```rust
fn seql(&self) -> usize { self.inner.seql() }
fn count(&self) -> u32 { self.count }
fn increment(&mut self) { self.count += 1; }
fn add(&mut self, n: u32) { self.count += n; }
fn set_count(&mut self, n: u32) { self.count = n; }
```
## ASCII encoding and decoding
@@ -72,7 +83,7 @@ const fn revcomp4(x: u8) -> u8 {
**Step 2 — realignment.** After step 1, `padding = n × 8 seql × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
```rust
let seql = self.n_kmers() + k - 1;
let seql = self.seql();
shift = n * 8 - seql * 2 // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)
@@ -93,7 +104,7 @@ bits[len - shift..].fill(false)
## Minimizer sliding window
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k m + 1 m-mer positions.
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which tracks the current minimizer with a **monotonic deque** (`Ring<MmerItem, 32>`) inside `RollingStat`, a rolling-window entropy and minimizer tracker.
Each deque entry stores:
@@ -101,20 +112,9 @@ Each deque entry stores:
|------------|-------|----------------------------------------------|
| `position` | usize | 0-based start of this m-mer in the segment |
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
| `hash` | u64 | $H(\text{canonical})$ — ordering key for random minimizer selection |
| `hash` | u64 | `hash_kmer(canonical << (64 2m))` — ordering key for random minimizer selection |
The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
```rust
fn hash_mmer(canonical: u64) -> u64 {
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
let x = x ^ (x >> 30);
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
let x = x ^ (x >> 27);
let x = x.wrapping_mul(0x94d049bb133111eb);
x ^ (x >> 31)
}
```
The hash uses the seeded splitmix64 finalizer (`mix64(raw ^ 0x9e3779b97f4a7c15)`), the same function as `kmer::hash_kmer`.
On each new nucleotide, once the window is full, the deque is updated:
@@ -133,24 +133,27 @@ On each new nucleotide, once the window is full, the deque is updated:
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
A super-kmer boundary is emitted when the minimizer changes: `current_minimizer != prev_minimizer`. `SuperKmerIter` also emits a boundary when:
- entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k1)
- super-kmer length reaches 256 nucleotides (cursor retreated by k)
## Kmer extraction
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i)`, which delegates to `PackedSeq::extract::<KLen>(i)` and returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
```rust
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
pub fn kmer(&self, i: usize) -> Result<Kmer, KmerError>
```
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
---
!!! abstract "Algorithm — Super-kmer reverse complement"
```text
procedure SuperKmerRevcomp(seq, SEQL):
seql ← NKMERS + k 1 -- nucleotide length
seql ← nucleotide length
n ← ⌈seql / 4⌉ -- number of bytes
shift ← n × 8 seql × 2 -- padding bits to flush