feat: centralize index configuration and add hybrid mode

Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 14:26:19 +02:00
parent 6f7abddeaf 7501b6e854
commit 98c14aade9
19 changed files with 420 additions and 441 deletions
@@ -27,10 +27,10 @@ part_XXXXX/

 After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.

-`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:
+`MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:

-1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
-2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.
+1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
+2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.

 `mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.

@@ -110,7 +110,7 @@ layer_i/
  mphf.bin         — ptr_hash phase-2 MPHF
  evidence.bin     — n × (chunk_id: 25 bits | rank: 7 bits) per slot  [exact mode]
  fingerprint.bin  — n × b-bit fingerprints per slot                  [approx mode]
-  layer_meta.json  — evidence kind, recorded at build time
+  [no layer_meta.json — mode stored once in partition-level meta.json]
 ```

 Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
@@ -121,32 +121,32 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0

 ### Evidence modes

-Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
+Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.

-**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
+**Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.

-**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
+**Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
+
+**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).

 ### Build functions

 ```
-MphfLayer::build(dir, block_bits, fill_slot)
-    Pass 1: par_iter over chunks via .idx → build mphf.bin
-    Pass 2: sequential iter → fill evidence.bin + call fill_slot
+MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
+    Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin  (no .idx used)
+    Pass 2: sequential iter → fill evidence files + call fill_slot
+            .idx written last for Exact/Hybrid (query-time only)

 MphfLayer::build_exact_evidence(dir, block_bits)
-    Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
+    Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
    Uses open_sequential(); no .idx required on entry

 MphfLayer::build_approx_evidence(dir, b, z)
-    Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
+    Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
    Uses open_sequential(); never writes .idx
-
-MphfLayer::build_evidence(dir, kind, block_bits)
-    Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
 ```

-`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
+There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.

 In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.

@@ -170,7 +170,7 @@ fn query(kmer) → Option<(layer_index, slot)>:
    return None
 ```

-`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
+`MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.

 ### Merging layers

@@ -20,21 +20,26 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico

 ---

-## Evidence kinds
+## Index mode (homogeneity invariant)

-Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
+A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.

 ```rust
-pub enum EvidenceKind {
+#[derive(Serialize, Deserialize, Default)]
+#[serde(tag = "type", rename_all = "snake_case")]
+pub enum IndexMode {
+    #[default]
    Exact,
    Approx { b: u8, z: u8 },
+    Hybrid { b: u8, z: u8 },
 }
 ```

-`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
+`IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json`.

- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.
+- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
+- **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
+- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.

 ---

@@ -55,44 +60,44 @@ pub struct MphfLayer {
 ```rust
 enum LayerEvidence {
    Exact  { evidence: Evidence, unitigs: UnitigFileReader },
-    Approx { fingerprint: FingerprintVec },
+    Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
+    Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
 }
 ```

+`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
+
 ### Query API

-Three public query methods, all returning `Option<usize>` (slot index):
+Two public query methods, both returning `Option<usize>` (slot index):

 ```rust
 pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
-pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
-pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
+pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
 ```

- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
+- `find`: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
+- `find_strict`: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx`).

-`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
+There are no `find_exact`/`find_approx` methods; panicking dispatch is eliminated.

 ### Build surface

 ```rust
-// Full MPHF + exact evidence build (two-pass, parallel)
-pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
+// Full MPHF + evidence build (two-pass)
+pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>

-// Evidence-only builds (MPHF already present in dir)
+// Evidence-only post-hoc builds (MPHF already present)
 pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
 pub fn build_approx_evidence(dir, b, z)      -> OLMResult<usize>
-pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize>  // dispatch
 ```

-`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
+`MphfLayer::build` runs two passes over `unitigs.bin`:

-1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
-2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
+1. **Pass 1** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>`, no file reopening) is passed to `new_from_par_iter` via `par_bridge()`. Produces `mphf.bin`. No `.idx` is read or created at this stage.
+2. **Pass 2** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).

-`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
+There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.

 For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.

@@ -133,38 +138,37 @@ pub struct Hit<T = ()> {
 ```rust
 // mode 1
 impl Layer<()> {
-    pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
+    pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
 }

 // mode 2
 impl Layer<PersistentCompactIntMatrix> {
-    pub fn build(out_dir: &Path, block_bits: u8,
+    pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                 count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
-    pub fn build_from_map(out_dir: &Path, block_bits: u8,
+    pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                          counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
 }

 // mode 3
 impl Layer<PersistentBitMatrix> {
-    pub fn build_presence(out_dir: &Path, block_bits: u8,
+    pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
                          n_genomes: usize,
                          present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
 }
 ```

-All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
+All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.

-### Evidence build helpers on Layer
+Evidence-only post-hoc builds are accessible directly on `Layer<D>`:

 ```rust
 impl<D: LayerData> Layer<D> {
    pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
    pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8)  -> OLMResult<usize>
-    pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
 }
 ```

-These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
+There is no `build_evidence` dispatch wrapper.

 ---

@@ -212,14 +216,17 @@ pub struct LayeredMap<D: LayerData = ()> {
 ### Common methods

 ```rust
-pub fn open(root: &Path)   -> OLMResult<Self>
-pub fn create(root: &Path) -> OLMResult<Self>
-pub fn n_layers(&self)     -> usize
-pub fn layer(&self, i: usize) -> &Layer<D>
+pub fn open(root: &Path)              -> OLMResult<Self>
+pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
+pub fn n_layers(&self)                -> usize
+pub fn layer(&self, i: usize)         -> &Layer<D>
+pub fn mode(&self)                    -> &IndexMode
 pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
-pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
+pub fn next_layer_writer(&self)       -> OLMResult<UnitigFileWriter>
 ```

+`open` reads `PartitionMeta` once, extracts `mode`, and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta`.
+
 `query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.

 ### push_layer
@@ -272,14 +279,13 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful

 ```
 partition_root/                    ← LayeredMap (one partition)
-  meta.json                        — {"n_layers": N}
+  meta.json                        — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
  layer_0/                         ← Layer
-    layer_meta.json                — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
    mphf.bin                       — ptr_hash MPHF (epserde format)
    unitigs.bin                    — packed 2-bit nucleotide sequences
-    unitigs.bin.idx                — UIDX index (exact evidence only)
-    evidence.bin                   — [u32; n], LE  (exact evidence only)
-    fingerprint.bin                — packed b-bit array  (approx evidence only)
+    unitigs.bin.idx                — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
+    evidence.bin                   — [u32; n], LE  (Exact/Hybrid only)
+    fingerprint.bin                — packed b-bit array  (Approx/Hybrid only)
    counts/                        [mode 2] PersistentCompactIntMatrix
      meta.json
      col_000000.pciv
@@ -290,7 +296,7 @@ partition_root/                    ← LayeredMap (one partition)
    …
 ```

-`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
+There is no `layer_meta.json`. The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.

 ---

@@ -387,4 +393,4 @@ Each partition's new layer is built independently; the operation is fully parall
 | `obiskio` | unitig file writer/reader + `.idx` build |
 | `obicompactvec` | payload types + aggregation traits |
 | `rayon 1` | parallel MPHF construction pass |
-| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
+| `serde / serde_json` | `PartitionMeta` serialisation |
@@ -61,35 +61,24 @@ File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record

 Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.

-### `open()` vs `open_sequential()`
+### `open()`, `open_sequential()`, `open_direct_access()`

-`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
+`UnitigFileReader` has three constructors:

-`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
+- `open(path)` — smart default: if `unitigs.bin.idx` exists, delegates to `open_direct_access`; otherwise delegates to `open_sequential`. Prefer this in call sites that don't require one specific mode.
+- `open_sequential(path)` — never reads `.idx`. Sequential iterators only; `chunk_start(i)` falls back to an O(i) mmap scan rather than panicking.
+- `open_direct_access(path)` — requires `.idx` to be present. Enables O(1) or O(2^block_bits) `chunk_start(i)`, used by `verify_canonical_kmer` at query time.

-### `chunk_start(i)` — random access
+`CanonicalKmerIter` — a clonable sequential iterator returned by `UnitigFileReader::iter_canonical_kmers()`. It holds an `Arc<Mmap>` so cloning resets the cursor to the start without reopening the file. This makes it usable with `par_bridge()` for parallel MPHF construction without random access.

-```rust
-fn chunk_start(&self, i: usize) -> usize {
-    // block_bits=0: single table lookup, O(1) — hot path
-    if self.block_bits == 0 {
-        return self.block_offsets[i] as usize;
-    }
-    // block_bits>0: lookup block, then scan at most 2^block_bits − 1 records
-    let block = i >> self.block_bits;
-    let rem   = i &  self.mask;
-    let mut offset = self.block_offsets[block] as usize;
-    for _ in 0..rem {
-        let seql_minus_k = self.mmap[offset] as usize;
-        offset += 1 + (seql_minus_k + self.k + 3) / 4;
-    }
-    offset
-}
-```
+### `chunk_start(i)` — access modes

-With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
+When `.idx` is loaded (`open_direct_access`):

-With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
+- `block_bits = 0`: single array lookup, O(1).
+- `block_bits > 0`: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).
+
+When `.idx` is absent (`open_sequential`): `chunk_start(i)` performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by `find_strict()` on Approx layers (sequential scan of all canonical kmers).

 ### Decoding a kmer from slot `s`