Files

T

Eric Coissac ebbfe35cbc Refactor: Extract utility function for string reversal

Extracted `inverser_chaine` into a reusable utility function with docstring and added unit test to ensure correctness.

2026-04-30 06:58:46 +02:00

5.0 KiB

Raw Blame History

MPHF selection — two-phase indexing architecture

Indexing architecture

Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.

Superkmer vs kmer counts

The SKFileMeta sidecar written by SKFileWriter records instances (unique superkmers) and length_sum (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as length_sum − instances × (k − 1). This is an overestimate of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.

Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.

Phase 1 — provisional index and spectrum

Enumerate all kmers from the dereplicated superkmers of the partition.
Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
Accumulate counts: for each kmer in each superkmer, count[MPHF(kmer)] += sk.count().
Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
Discard the provisional MPHF.

Phase 2 — definitive index

Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).

Candidates

boomphf (BBHash algorithm, maintained by 10X Genomics):

~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
Parallel construction; well-tested with DNA kmer data at scale
Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2

ptr_hash (PtrHash algorithm, Groot Koerkamp, SEA 2025):

~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
Requires exact key count at construction — available at phase 2
Drawback: published February 2025 — very young, no production track record

FMPHGO (ph crate, Beling, ACM JEA 2023):

~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
More established than ptr_hash; actively maintained
Works well with overestimated capacity → natural fit for phase 1

MPHF choice per phase

Phase 1 (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.

Phase 2 (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.

boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.

Space at scale

For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):

MPHF	bits/key	Total MPHF size
boomphf	3.7	~47 GB
ptr_hash	2.4	~31 GB
FMPHGO	2.1	~27 GB

For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

On-disk and mmap considerations

All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the ph crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.

No established Rust crate provides a natively on-disk MPHF. SSHash (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.

Open questions

Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
Revisit ptr_hash for phase 2 once the crate has broader production track record.
Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.
Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.

5.0 KiB Raw Blame History Unescape Escape