Files
obikmer/docmd/implementation/mphf.md
T
Eric Coissac 4736a7b6de refactor: restructure k-mer partitioning pipeline for memory efficiency
Replace in-memory hashing with a disk-backed external merge sort and `PersistentCompactIntVec` to drastically reduce peak RAM. Unify both phases using a custom `PtrHash` MPHF, eliminating `GOFunction` and `boomphf`. Introduce a concrete three-step `count_partition()` pipeline with adaptive chunk sizing based on available system memory. Update dependencies to `memmap2`, `ptr_hash`, and `obicompactvec`. Additionally, document strict genomics-only memory constraints and enforce an architectural feedback workflow requiring explicit user authorization before structural changes.
2026-05-17 16:08:47 +08:00

139 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MPHF selection — two-phase indexing architecture
## Why two phases are needed
Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.
### Phase 1 — provisional MPHF + kmer spectrum
Implemented in `obikpartitionner::KmerPartition::count_kmer()``count_partition()`.
1. **External sort**: read the dereplicated superkmer file; extract the raw `u64` canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: `sorted_unique.bin` — a flat array of f0 distinct sorted `u64` values. Exact kmer count f0 is known at this point.
2. **Build provisional MPHF** (ptr_hash, same configuration as phase 2) over `sorted_unique.bin` using `new_from_par_iter`. Delete `sorted_unique.bin` immediately after. Persist to `mphf1.bin`.
3. **Create `counts1.bin`**: `PersistentCompactIntVec` with f0 slots, zero-initialised.
4. **Accumulation pass**: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute `slot = mphf.index(kmer.raw())` and increment `counts1[slot]` by the superkmer's COUNT.
5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
Files produced per partition:
```
part_XXXXX/
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
kmer_spectrum_raw.json — local frequency spectrum
```
### Phase 2 — definitive MPHF
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
`MphfLayer::build` is called on the unitig file:
1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
---
## MPHF candidates
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
- Supports streaming construction (no exact count needed)
- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64) and fastest construction (≥3.1×)
- Requires exact key count at construction — available at both phases after pass 1
- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file
**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
- ~2.1 bits/key — most compact; good query speed; deterministic construction
- `GOFunction` (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well
## MPHF choice per phase
**Both phases**: **ptr_hash**, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the `ph` crate dependency.
boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.
---
## Space at scale
For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
| MPHF | bits/key | Total MPHF size |
|----------|----------|-----------------|
| boomphf | 3.7 | ~47 GB |
| ptr_hash | 2.4 | ~31 GB |
| FMPH | 2.1 | ~27 GB |
For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 330 M unique kmers → 18 MB per phase-2 MPHF, well within RAM.
---
## ptr_hash configuration (phase 2)
```rust
type Mphf = PtrHash<
u64, // key: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots
>;
```
**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.
**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.
**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). One cacheline per query; space win dominates at billion-scale key counts.
---
## Multilayer index architecture
### Layer structure
Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:
```
layer_i/
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
mphf.bin — ptr_hash phase-2 MPHF
evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
```
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
1. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
2. Collect kmers of B not present in any layer → set `B \ A`.
3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).
### Membership verification
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
### Query algorithm
```
fn query(kmer) → Option<(layer_index, slot)>:
for (i, layer) in layers.iter().enumerate():
slot = layer.mphf.index(kmer)
if layer.evidence.decode(slot) matches kmer:
return Some((i, slot))
return None
```
Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
### Merging layers
Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.