Files
obikmer/docmd/implementation/mphf.md
T
2026-04-19 12:17:16 +02:00

2.6 KiB
Raw Blame History

MPHF selection — analysis in progress

The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.

Candidates

boomphf (BBHash algorithm, maintained by 10X Genomics):

  • ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
  • Parallel construction; well-tested with DNA kmer data at scale
  • Drawback: largest space footprint of the three

ptr_hash (PtrHash algorithm, Groot Koerkamp, SEA 2025):

  • ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
  • Theoretical foundation solid; paper and Rust crate from the same author
  • Drawback: published February 2025 — very young, no production track record

FMPHGO (ph crate, Beling, ACM JEA 2023):

  • ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
  • More established than ptr_hash; actively maintained
  • Currently preferred candidate

Space at scale

For 1 024 partitions × 100 M kmers/partition:

MPHF bits/key Total MPHF size
boomphf 3.7 ~47 GB
ptr_hash 2.4 ~31 GB
FMPHGO 2.1 ~27 GB

In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 330 M kmers → 18 MB per MPHF, well within RAM.

On-disk and mmap considerations

All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the ph crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 18 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.

No established Rust crate provides a natively on-disk MPHF. SSHash (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.

Open questions

  • Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.
  • Evaluate whether ptr_hash's query speed advantage (2.13.3×) justifies adopting a crate that is less than a year old.
  • Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.
  • Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.