# MPHF selection — analysis in progress The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated. ## Candidates **boomphf** (BBHash algorithm, maintained by 10X Genomics): - ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem) - Parallel construction; well-tested with DNA kmer data at scale - Drawback: largest space footprint of the three **ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025): - ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×) - Theoretical foundation solid; paper and Rust crate from the same author - Drawback: published February 2025 — very young, no production track record **FMPHGO** (`ph` crate, Beling, ACM JEA 2023): - ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction - More established than ptr_hash; actively maintained - Currently preferred candidate ## Space at scale For 1 024 partitions × 100 M kmers/partition: | MPHF | bits/key | Total MPHF size | |---------|----------|-----------------| | boomphf | 3.7 | ~47 GB | | ptr_hash | 2.4 | ~31 GB | | FMPHGO | 2.1 | ~27 GB | In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM. ## On-disk and mmap considerations All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker. No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer. ## Open questions - Confirm actual partition sizes on representative metagenomic datasets before fixing the choice. - Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old. - Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary. - Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.