first implementation but far to be optimal
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
# MPHF selection — analysis in progress
|
||||
|
||||
The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.
|
||||
|
||||
## Candidates
|
||||
|
||||
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
|
||||
|
||||
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
|
||||
- Parallel construction; well-tested with DNA kmer data at scale
|
||||
- Drawback: largest space footprint of the three
|
||||
|
||||
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
|
||||
|
||||
- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
|
||||
- Theoretical foundation solid; paper and Rust crate from the same author
|
||||
- Drawback: published February 2025 — very young, no production track record
|
||||
|
||||
**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
|
||||
|
||||
- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
|
||||
- More established than ptr_hash; actively maintained
|
||||
- Currently preferred candidate
|
||||
|
||||
## Space at scale
|
||||
|
||||
For 1 024 partitions × 100 M kmers/partition:
|
||||
|
||||
| MPHF | bits/key | Total MPHF size |
|
||||
|---------|----------|-----------------|
|
||||
| boomphf | 3.7 | ~47 GB |
|
||||
| ptr_hash | 2.4 | ~31 GB |
|
||||
| FMPHGO | 2.1 | ~27 GB |
|
||||
|
||||
In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.
|
||||
|
||||
## On-disk and mmap considerations
|
||||
|
||||
All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
|
||||
|
||||
No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
|
||||
- Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.
|
||||
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
|
||||
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
|
||||
Reference in New Issue
Block a user