Files
obikmer/docmd/implementation/mphf.md
T
2026-04-19 12:17:16 +02:00

49 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MPHF selection — analysis in progress
The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.
## Candidates
**boomphf** (BBHash algorithm, maintained by 10X Genomics):
- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
- Parallel construction; well-tested with DNA kmer data at scale
- Drawback: largest space footprint of the three
**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
- Theoretical foundation solid; paper and Rust crate from the same author
- Drawback: published February 2025 — very young, no production track record
**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
- More established than ptr_hash; actively maintained
- Currently preferred candidate
## Space at scale
For 1 024 partitions × 100 M kmers/partition:
| MPHF | bits/key | Total MPHF size |
|---------|----------|-----------------|
| boomphf | 3.7 | ~47 GB |
| ptr_hash | 2.4 | ~31 GB |
| FMPHGO | 2.1 | ~27 GB |
In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 330 M kmers → 18 MB per MPHF, well within RAM.
## On-disk and mmap considerations
All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 18 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
## Open questions
- Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
- Evaluate whether ptr_hash's query speed advantage (2.13.3×) justifies adopting a crate that is less than a year old.
- Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.