98c14aade9
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
178 lines
9.8 KiB
Markdown
178 lines
9.8 KiB
Markdown
# MPHF selection — two-phase indexing architecture
|
||
|
||
## Why two phases are needed
|
||
|
||
Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.
|
||
|
||
### Phase 1 — provisional MPHF + kmer spectrum
|
||
|
||
Implemented in `obikpartitionner::KmerPartition::count_kmer()` → `count_partition()`.
|
||
|
||
1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
|
||
2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
|
||
3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
|
||
4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
|
||
5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
|
||
|
||
Files produced per partition:
|
||
|
||
```
|
||
part_XXXXX/
|
||
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
|
||
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
|
||
kmer_spectrum_raw.json — local frequency spectrum
|
||
```
|
||
|
||
### Phase 2 — definitive MPHF
|
||
|
||
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
|
||
|
||
`MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:
|
||
|
||
1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
|
||
2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.
|
||
|
||
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
|
||
|
||
---
|
||
|
||
## MPHF candidates
|
||
|
||
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
|
||
|
||
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
|
||
- Supports streaming construction (no exact count needed)
|
||
- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available
|
||
|
||
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
|
||
|
||
- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
|
||
- Requires exact key count at construction — available at both phases after pass 1
|
||
- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file
|
||
|
||
**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
|
||
|
||
- ~2.1 bits/key — most compact; good query speed; deterministic construction
|
||
- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well
|
||
|
||
## MPHF choice per phase
|
||
|
||
**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.
|
||
|
||
boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.
|
||
|
||
---
|
||
|
||
## Space at scale
|
||
|
||
For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
|
||
|
||
| MPHF | bits/key | Total MPHF size |
|
||
|----------|----------|-----------------|
|
||
| boomphf | 3.7 | ~47 GB |
|
||
| ptr_hash | 2.4 | ~31 GB |
|
||
| FMPH | 2.1 | ~27 GB |
|
||
|
||
For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.
|
||
|
||
---
|
||
|
||
## ptr_hash configuration (phase 2)
|
||
|
||
```rust
|
||
type Mphf = PtrHash<
|
||
u64, // key: canonical kmer raw encoding
|
||
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
|
||
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
|
||
Xx64, // hasher: XXH3-64 with seed
|
||
Vec<u8>, // pilots
|
||
>;
|
||
```
|
||
|
||
**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.
|
||
|
||
**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.
|
||
|
||
**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). One cacheline per query; space win dominates at billion-scale key counts.
|
||
|
||
---
|
||
|
||
## Multilayer index architecture
|
||
|
||
### Layer structure
|
||
|
||
Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:
|
||
|
||
```
|
||
layer_i/
|
||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
|
||
unitigs.bin.idx — random-access block index (block_bits controls granularity)
|
||
mphf.bin — ptr_hash phase-2 MPHF
|
||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||
[no layer_meta.json — mode stored once in partition-level meta.json]
|
||
```
|
||
|
||
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
|
||
|
||
1. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
|
||
2. Collect kmers of B not present in any layer → set `B \ A`.
|
||
3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).
|
||
|
||
### Evidence modes
|
||
|
||
Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.
|
||
|
||
**Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.
|
||
|
||
**Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
|
||
|
||
**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).
|
||
|
||
### Build functions
|
||
|
||
```
|
||
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
|
||
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
|
||
Pass 2: sequential iter → fill evidence files + call fill_slot
|
||
.idx written last for Exact/Hybrid (query-time only)
|
||
|
||
MphfLayer::build_exact_evidence(dir, block_bits)
|
||
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||
Uses open_sequential(); no .idx required on entry
|
||
|
||
MphfLayer::build_approx_evidence(dir, b, z)
|
||
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||
Uses open_sequential(); never writes .idx
|
||
```
|
||
|
||
There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.
|
||
|
||
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
|
||
|
||
### Membership verification
|
||
|
||
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:
|
||
|
||
- **Exact**: decode `(chunk_id, rank)` from `evidence.bin`; reconstruct the kmer via `unitigs.verify_canonical_kmer`; compare to query.
|
||
- **Approx**: compare `kmer.seq_hash()` to the b-bit fingerprint stored at the slot.
|
||
|
||
A mismatch in either mode means the kmer is absent from this layer; probe the next layer.
|
||
|
||
### Query algorithm
|
||
|
||
```
|
||
fn query(kmer) → Option<(layer_index, slot)>:
|
||
for (i, layer) in layers.iter().enumerate():
|
||
slot = layer.mphf.index(kmer)
|
||
if layer.evidence.matches(slot, kmer): // exact or approx dispatch
|
||
return Some((i, slot))
|
||
return None
|
||
```
|
||
|
||
`MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
||
|
||
### Merging layers
|
||
|
||
Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
|