refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -2,36 +2,47 @@

 ## Memory layout

-A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte):
-
-| Field | Bits | Role |
-|-------|------|------|
-| COUNT | 24   | Occurrence count (≤ 16 M) |
-| NKMERS | 8   | Number of kmers (= seq_length − k + 1, range 1–255) |
-
-Bit layout (MSB to LSB): `[31:8] COUNT  [7:0] NKMERS`
-
-NKMERS is stored as a raw `u8` in **kmer units**, not nucleotides. The nucleotide length is recovered as `NKMERS + k − 1`. This avoids the awkward wrapping convention (`0 = 256`) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:
-
-| unit | u8 covers | max nucleotides |
-|---|---|---|
-| nucleotides | 255 nt | 225 kmers |
-| **kmers** | **255 kmers** | **285 nt** |
-
-The public accessors:
+`SuperKmer` holds two separate fields:

 ```rust
-fn n_kmers(&self)            -> usize { (self.0 & 0xFF) as usize }
-fn seql(&self)               -> usize { self.n_kmers() + K - 1 }
-fn count(&self)              -> u32   { self.0 >> 8 }
-fn increment(&mut self)               { self.0 += 1 << 8; }
-fn add(&mut self, n: u32)             { self.0 += n << 8; }
-fn set_count(&mut self, n: u32)       { self.0 = (self.0 & 0xFF) | (n << 8); }
+pub struct SuperKmer {
+    pub(crate) count: u32,
+    pub(crate) inner: PackedSeq,
+}
 ```

-In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).
+`PackedSeq` stores a 2-bit packed DNA sequence as a heap-allocated `Box<[u8]>` plus a `tail: u8` field:

-The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
+| Field | Type | Role |
+|-------|------|------|
+| `tail` | `u8` | Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity |
+| `seq`  | `Box<[u8]>` | 2-bit packed bytes, nucleotide 0 at bits 7–6 of `seq[0]` |
+
+Nucleotide length is recovered without storing it explicitly:
+
+```text
+seql = (seq.len() - 1) * 4 + tail_count(tail)
+```
+
+There is no packed header word — `count` and the sequence live in separate fields.
+
+The on-disk binary format (produced by `write_to_binary`) is:
+
+```text
+[varint(count)] [u8: seql − k] [packed bytes…]
+```
+
+`seql − k` fits in a `u8` when `n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)`. If a super-kmer exceeds 256 kmers, `write_to_binary` splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by `read_from_binary`.
+
+The public accessors operate on the struct fields directly:
+
+```rust
+fn seql(&self)            -> usize { self.inner.seql() }
+fn count(&self)           -> u32   { self.count }
+fn increment(&mut self)            { self.count += 1; }
+fn add(&mut self, n: u32)          { self.count += n; }
+fn set_count(&mut self, n: u32)    { self.count = n; }
+```

 ## ASCII encoding and decoding

@@ -72,7 +83,7 @@ const fn revcomp4(x: u8) -> u8 {
 **Step 2 — realignment.**  After step 1, `padding = n × 8 − seql × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:

 ```rust
-let seql = self.n_kmers() + k - 1;
+let seql = self.seql();
 shift = n * 8 - seql * 2          // number of padding bits
 bits.rotate_left(shift)
 bits[len - shift..].fill(false)
@@ -93,7 +104,7 @@ bits[len - shift..].fill(false)

 ## Minimizer sliding window

-Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k − m + 1 m-mer positions.
+Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which tracks the current minimizer with a **monotonic deque** (`Ring<MmerItem, 32>`) inside `RollingStat`, a rolling-window entropy and minimizer tracker.

 Each deque entry stores:

@@ -101,20 +112,9 @@ Each deque entry stores:
 |------------|-------|----------------------------------------------|
 | `position` | usize | 0-based start of this m-mer in the segment   |
 | `canonical`| u64   | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
-| `hash`     | u64   | $H(\text{canonical})$ — ordering key for random minimizer selection |
+| `hash`     | u64   | `hash_kmer(canonical << (64 − 2m))` — ordering key for random minimizer selection |

-The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
-
-```rust
-fn hash_mmer(canonical: u64) -> u64 {
-    let x = canonical ^ 0x9e3779b97f4a7c15;   // seed: eliminates fixed point at 0
-    let x = x ^ (x >> 30);
-    let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
-    let x = x ^ (x >> 27);
-    let x = x.wrapping_mul(0x94d049bb133111eb);
-    x ^ (x >> 31)
-}
-```
+The hash uses the seeded splitmix64 finalizer (`mix64(raw ^ 0x9e3779b97f4a7c15)`), the same function as `kmer::hash_kmer`.

 On each new nucleotide, once the window is full, the deque is updated:

@@ -133,24 +133,27 @@ On each new nucleotide, once the window is full, the deque is updated:

 The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.

-A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
+A super-kmer boundary is emitted when the minimizer changes: `current_minimizer != prev_minimizer`. `SuperKmerIter` also emits a boundary when:
+
+- entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1)
+- super-kmer length reaches 256 nucleotides (cursor retreated by k)

 ## Kmer extraction

-A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
+A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i)`, which delegates to `PackedSeq::extract::<KLen>(i)` and returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):

 ```rust
-pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
+pub fn kmer(&self, i: usize) -> Result<Kmer, KmerError>
 ```

-The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
+The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.

 ---

 !!! abstract "Algorithm — Super-kmer reverse complement"
    ```text
    procedure SuperKmerRevcomp(seq, SEQL):
-        seql  ← NKMERS + k − 1               -- nucleotide length
+        seql  ← nucleotide length
        n     ← ⌈seql / 4⌉                  -- number of bytes
        shift ← n × 8 − seql × 2            -- padding bits to flush