feat: centralize index configuration and add hybrid mode

Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
This commit is contained in:
Eric Coissac
2026-05-26 14:26:19 +02:00
19 changed files with 420 additions and 441 deletions
+17 -17
View File
@@ -27,10 +27,10 @@ part_XXXXX/
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:
`MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:
1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.
1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
@@ -110,7 +110,7 @@ layer_i/
mphf.bin — ptr_hash phase-2 MPHF
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
layer_meta.json evidence kind, recorded at build time
[no layer_meta.json — mode stored once in partition-level meta.json]
```
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
@@ -121,32 +121,32 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
### Evidence modes
Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.
**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
**Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.
**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
**Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).
### Build functions
```
MphfLayer::build(dir, block_bits, fill_slot)
Pass 1: par_iter over chunks via .idx → build mphf.bin
Pass 2: sequential iter → fill evidence.bin + call fill_slot
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
Pass 2: sequential iter → fill evidence files + call fill_slot
.idx written last for Exact/Hybrid (query-time only)
MphfLayer::build_exact_evidence(dir, block_bits)
Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
Uses open_sequential(); no .idx required on entry
MphfLayer::build_approx_evidence(dir, b, z)
Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
Uses open_sequential(); never writes .idx
MphfLayer::build_evidence(dir, kind, block_bits)
Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
```
`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
@@ -170,7 +170,7 @@ fn query(kmer) → Option<(layer_index, slot)>:
return None
```
`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
`MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
### Merging layers
+48 -42
View File
@@ -20,21 +20,26 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
---
## Evidence kinds
## Index mode (homogeneity invariant)
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.
```rust
pub enum EvidenceKind {
#[derive(Serialize, Deserialize, Default)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum IndexMode {
#[default]
Exact,
Approx { b: u8, z: u8 },
Hybrid { b: u8, z: u8 },
}
```
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
`IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json`.
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L k z + 2 is the number of windows in a read of length L. No `.idx` written or required.
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
- **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.
---
@@ -55,44 +60,44 @@ pub struct MphfLayer {
```rust
enum LayerEvidence {
Exact { evidence: Evidence, unitigs: UnitigFileReader },
Approx { fingerprint: FingerprintVec },
Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
}
```
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
### Query API
Three public query methods, all returning `Option<usize>` (slot index):
Two public query methods, both returning `Option<usize>` (slot index):
```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
```
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
- `find`: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
- `find_strict`: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx`).
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
There are no `find_exact`/`find_approx` methods; panicking dispatch is eliminated.
### Build surface
```rust
// Full MPHF + exact evidence build (two-pass, parallel)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
// Full MPHF + evidence build (two-pass)
pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>
// Evidence-only builds (MPHF already present in dir)
// Evidence-only post-hoc builds (MPHF already present)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
```
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
`MphfLayer::build` runs two passes over `unitigs.bin`:
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
1. **Pass 1** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>`, no file reopening) is passed to `new_from_par_iter` via `par_bridge()`. Produces `mphf.bin`. No `.idx` is read or created at this stage.
2. **Pass 2** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
@@ -133,38 +138,37 @@ pub struct Hit<T = ()> {
```rust
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, block_bits: u8,
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, block_bits: u8,
pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
// mode 3
impl Layer<PersistentBitMatrix> {
pub fn build_presence(out_dir: &Path, block_bits: u8,
pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}
```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.
### Evidence build helpers on Layer
Evidence-only post-hoc builds are accessible directly on `Layer<D>`:
```rust
impl<D: LayerData> Layer<D> {
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
}
```
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
There is no `build_evidence` dispatch wrapper.
---
@@ -212,14 +216,17 @@ pub struct LayeredMap<D: LayerData = ()> {
### Common methods
```rust
pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self>
pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn mode(&self) -> &IndexMode
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
```
`open` reads `PartitionMeta` once, extracts `mode`, and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta`.
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
### push_layer
@@ -272,14 +279,13 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
```
partition_root/ ← LayeredMap (one partition)
meta.json — {"n_layers": N}
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
layer_0/ ← Layer
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index (exact evidence only)
evidence.bin — [u32; n], LE (exact evidence only)
fingerprint.bin — packed b-bit array (approx evidence only)
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
evidence.bin — [u32; n], LE (Exact/Hybrid only)
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
counts/ [mode 2] PersistentCompactIntMatrix
meta.json
col_000000.pciv
@@ -290,7 +296,7 @@ partition_root/ ← LayeredMap (one partition)
```
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
There is no `layer_meta.json`. The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.
---
@@ -387,4 +393,4 @@ Each partition's new layer is built independently; the operation is fully parall
| `obiskio` | unitig file writer/reader + `.idx` build |
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
| `serde / serde_json` | `PartitionMeta` serialisation |
+12 -23
View File
@@ -61,35 +61,24 @@ File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
### `open()` vs `open_sequential()`
### `open()`, `open_sequential()`, `open_direct_access()`
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
`UnitigFileReader` has three constructors:
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
- `open(path)` — smart default: if `unitigs.bin.idx` exists, delegates to `open_direct_access`; otherwise delegates to `open_sequential`. Prefer this in call sites that don't require one specific mode.
- `open_sequential(path)` — never reads `.idx`. Sequential iterators only; `chunk_start(i)` falls back to an O(i) mmap scan rather than panicking.
- `open_direct_access(path)` — requires `.idx` to be present. Enables O(1) or O(2^block_bits) `chunk_start(i)`, used by `verify_canonical_kmer` at query time.
### `chunk_start(i)` — random access
`CanonicalKmerIter` — a clonable sequential iterator returned by `UnitigFileReader::iter_canonical_kmers()`. It holds an `Arc<Mmap>` so cloning resets the cursor to the start without reopening the file. This makes it usable with `par_bridge()` for parallel MPHF construction without random access.
```rust
fn chunk_start(&self, i: usize) -> usize {
// block_bits=0: single table lookup, O(1) — hot path
if self.block_bits == 0 {
return self.block_offsets[i] as usize;
}
// block_bits>0: lookup block, then scan at most 2^block_bits 1 records
let block = i >> self.block_bits;
let rem = i & self.mask;
let mut offset = self.block_offsets[block] as usize;
for _ in 0..rem {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
}
```
### `chunk_start(i)` — access modes
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
When `.idx` is loaded (`open_direct_access`):
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
- `block_bits = 0`: single array lookup, O(1).
- `block_bits > 0`: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).
When `.idx` is absent (`open_sequential`): `chunk_start(i)` performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by `find_strict()` on Approx layers (sequential scan of all canonical kmers).
### Decoding a kmer from slot `s`