feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
This commit is contained in:
@@ -27,10 +27,10 @@ part_XXXXX/
|
||||
|
||||
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
|
||||
|
||||
`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:
|
||||
`MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:
|
||||
|
||||
1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
|
||||
2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.
|
||||
1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
|
||||
2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.
|
||||
|
||||
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
|
||||
|
||||
@@ -110,7 +110,7 @@ layer_i/
|
||||
mphf.bin — ptr_hash phase-2 MPHF
|
||||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||||
layer_meta.json — evidence kind, recorded at build time
|
||||
[no layer_meta.json — mode stored once in partition-level meta.json]
|
||||
```
|
||||
|
||||
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
|
||||
@@ -121,32 +121,32 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
|
||||
|
||||
### Evidence modes
|
||||
|
||||
Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
|
||||
Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.
|
||||
|
||||
**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
|
||||
**Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.
|
||||
|
||||
**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
|
||||
**Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
|
||||
|
||||
**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).
|
||||
|
||||
### Build functions
|
||||
|
||||
```
|
||||
MphfLayer::build(dir, block_bits, fill_slot)
|
||||
Pass 1: par_iter over chunks via .idx → build mphf.bin
|
||||
Pass 2: sequential iter → fill evidence.bin + call fill_slot
|
||||
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
|
||||
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
|
||||
Pass 2: sequential iter → fill evidence files + call fill_slot
|
||||
.idx written last for Exact/Hybrid (query-time only)
|
||||
|
||||
MphfLayer::build_exact_evidence(dir, block_bits)
|
||||
Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); no .idx required on entry
|
||||
|
||||
MphfLayer::build_approx_evidence(dir, b, z)
|
||||
Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); never writes .idx
|
||||
|
||||
MphfLayer::build_evidence(dir, kind, block_bits)
|
||||
Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
|
||||
```
|
||||
|
||||
`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
|
||||
There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.
|
||||
|
||||
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
|
||||
|
||||
@@ -170,7 +170,7 @@ fn query(kmer) → Option<(layer_index, slot)>:
|
||||
return None
|
||||
```
|
||||
|
||||
`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
||||
`MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
||||
|
||||
### Merging layers
|
||||
|
||||
|
||||
Reference in New Issue
Block a user