refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -27,10 +27,10 @@ part_XXXXX/

 After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.

-`MphfLayer::build` is called on the unitig file:
+`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:

-1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
-2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
+1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
+2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.

 `mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.

@@ -105,9 +105,12 @@ Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for t

 ```
 layer_i/
-  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
+  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence source)
+  unitigs.bin.idx  — random-access block index (block_bits controls granularity)
  mphf.bin         — ptr_hash phase-2 MPHF
-  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
+  evidence.bin     — n × (chunk_id: 25 bits | rank: 7 bits) per slot  [exact mode]
+  fingerprint.bin  — n × b-bit fingerprints per slot                  [approx mode]
+  layer_meta.json  — evidence kind, recorded at build time
 ```

 Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
@@ -116,9 +119,45 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
 2. Collect kmers of B not present in any layer → set `B \ A`.
 3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).

+### Evidence modes
+
+Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
+
+**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
+
+**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
+
+### Build functions
+
+```
+MphfLayer::build(dir, block_bits, fill_slot)
+    Pass 1: par_iter over chunks via .idx → build mphf.bin
+    Pass 2: sequential iter → fill evidence.bin + call fill_slot
+
+MphfLayer::build_exact_evidence(dir, block_bits)
+    Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
+    Uses open_sequential(); no .idx required on entry
+
+MphfLayer::build_approx_evidence(dir, b, z)
+    Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
+    Uses open_sequential(); never writes .idx
+
+MphfLayer::build_evidence(dir, kind, block_bits)
+    Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
+```
+
+`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
+
+In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
+
 ### Membership verification

-ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
+ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:
+
+- **Exact**: decode `(chunk_id, rank)` from `evidence.bin`; reconstruct the kmer via `unitigs.verify_canonical_kmer`; compare to query.
+- **Approx**: compare `kmer.seq_hash()` to the b-bit fingerprint stored at the slot.
+
+A mismatch in either mode means the kmer is absent from this layer; probe the next layer.

 ### Query algorithm

@@ -126,12 +165,12 @@ ptr_hash maps any input to a valid slot — it does not natively detect absent k
 fn query(kmer) → Option<(layer_index, slot)>:
    for (i, layer) in layers.iter().enumerate():
        slot = layer.mphf.index(kmer)
-        if layer.evidence.decode(slot) matches kmer:
+        if layer.evidence.matches(slot, kmer):   // exact or approx dispatch
            return Some((i, slot))
    return None
 ```

-Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
+`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.

 ### Merging layers