docmd/implementation/mphf.md

# MPHF selection — two-phase indexing architecture

## Indexing architecture

Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.

### Superkmer vs kmer counts

The `SKFileMeta` sidecar written by `SKFileWriter` records `instances` (unique superkmers) and `length_sum` (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as `length_sum − instances × (k − 1)`. This is an **overestimate** of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.

Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.

### Phase 1 — provisional index and spectrum

1. Enumerate all kmers from the dereplicated superkmers of the partition.
2. Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
3. Accumulate counts: for each kmer in each superkmer, `count[MPHF(kmer)] += sk.count()`.
4. Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
5. Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
6. Discard the provisional MPHF.

### Phase 2 — definitive index

Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).

---

## Candidates

**boomphf** (BBHash algorithm, maintained by 10X Genomics):

- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
- Parallel construction; well-tested with DNA kmer data at scale
- Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2

**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):

- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
- Requires exact key count at construction — available at phase 2
- Drawback: published February 2025 — very young, no production track record

**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
- More established than ptr_hash; actively maintained
- Works well with overestimated capacity → natural fit for phase 1

## MPHF choice per phase

**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.

**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.

boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.

---

## Space at scale

For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):

| MPHF     | bits/key | Total MPHF size |
|----------|----------|-----------------|
| boomphf  | 3.7      | ~47 GB          |
| ptr_hash | 2.4      | ~31 GB          |
| FMPHGO   | 2.1      | ~27 GB          |

For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

## On-disk and mmap considerations

All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.

No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.

---

## Multilayer index architecture

### Motivation

An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.

### Layer structure

Each layer is a self-contained unit:

```
layer_i/
  unitigs.bin     — packed 2-bit nucleotide sequences
  mphf.bin        — ptr_hash index (phase-2, exact key count)
  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
  counts.bin      — [u32] per MPHF slot
```

Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:

1. For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
2. Collect all kmers of B not present in any existing layer → set `B \ A`.
3. Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).

Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.

### Membership verification

ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.

This makes the evidence layer load-bearing for correctness, not only for locality.

### Query algorithm

```
fn query(kmer) → Option<count>:
    for layer in layers:
        slot = layer.mphf.query(kmer)
        if layer.evidence.decode(slot) == kmer:
            return Some(layer.counts[slot])
    return None
```

Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.

### Layer count and probe cost

Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.

### Merging layers

Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.

## Open questions

- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								# MPHF selection — two-phase indexing architecture
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								## Indexing architecture
 								Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.
 								### Superkmer vs kmer counts
 								The `SKFileMeta` sidecar written by `SKFileWriter` records `instances` (unique superkmers) and `length_sum` (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as `length_sum − instances × (k − 1)`. This is an **overestimate** of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.
 								Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.
 								### Phase 1 — provisional index and spectrum
 . Enumerate all kmers from the dereplicated superkmers of the partition.
 . Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
 . Accumulate counts: for each kmer in each superkmer, `count[MPHF(kmer)] += sk.count()`.
 . Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
 . Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
 . Discard the provisional MPHF.
 								### Phase 2 — definitive index
 								Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).
 								---
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## Candidates
 								**boomphf** (BBHash algorithm, maintained by 10X Genomics):
 								- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
 								- Parallel construction; well-tested with DNA kmer data at scale
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								- Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
 								- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								- Requires exact key count at construction — available at phase 2
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								- Drawback: published February 2025 — very young, no production track record
 								**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
 								- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
 								- More established than ptr_hash; actively maintained
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								- Works well with overestimated capacity → natural fit for phase 1
 								## MPHF choice per phase
 								**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
 								boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.
 								---
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## Space at scale
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								| MPHF     | bits/key | Total MPHF size |
 								|----------|----------|-----------------|
 								| boomphf  | 3.7      | ~47 GB          |
 								| ptr_hash | 2.4      | ~31 GB          |
 								| FMPHGO   | 2.1      | ~27 GB          |
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## On-disk and mmap considerations
 								All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
 								No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								---
 								## Multilayer index architecture
 								### Motivation
 								An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.
 								### Layer structure
 								Each layer is a self-contained unit:
 								```
 								layer_i/
 								  unitigs.bin     — packed 2-bit nucleotide sequences
 								  mphf.bin        — ptr_hash index (phase-2, exact key count)
 								  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
 								  counts.bin      — [u32] per MPHF slot
 								```
 								Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:
 . For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
 . Collect all kmers of B not present in any existing layer → set `B \ A`.
 . Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).
 								Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.
 								### Membership verification
 								ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
 								This makes the evidence layer load-bearing for correctness, not only for locality.
 								### Query algorithm
 								```
 								fn query(kmer) → Option<count>:
 								    for layer in layers:
 								        slot = layer.mphf.query(kmer)
 								        if layer.evidence.decode(slot) == kmer:
 								            return Some(layer.counts[slot])
 								    return None
 								```
 								Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.
 								### Layer count and probe cost
 								Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.
 								### Merging layers
 								Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								## Open questions
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
+								- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.