docmd/implementation/mphf.md

# MPHF selection — two-phase indexing architecture

## Why two phases are needed

Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.

### Phase 1 — provisional MPHF + kmer spectrum

Implemented in `obikpartitionner::KmerPartition::count_kmer()`.

1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.

Files produced per partition:

```
part_XXXXX/
  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
  counts1.bin             — [u32; n_kmers] kmer counts, mmap'd
  kmer_spectrum_raw.json  — local frequency spectrum
```

### Phase 2 — definitive MPHF

After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.

`MphfLayer::build` is called on the unitig file:

1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.

`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.

---

## MPHF candidates

**boomphf** (BBHash algorithm, maintained by 10X Genomics):

- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
- Supports streaming construction (no exact count needed)
- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available

**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):

- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
- Requires exact key count at construction — available at both phases after pass 1
- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file

**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

- ~2.1 bits/key — most compact; good query speed; deterministic construction
- Works well from an exact or slightly overestimated count
- `GOFunction` (group-oriented variant) is the specific type used

## MPHF choice per phase

**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.

**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.

boomphf is eliminated: largest space overhead, streaming advantage does not apply.

---

## Space at scale

For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):

| MPHF     | bits/key | Total MPHF size |
|----------|----------|-----------------|
| boomphf  | 3.7      | ~47 GB          |
| ptr_hash | 2.4      | ~31 GB          |
| FMPH     | 2.1      | ~27 GB          |

For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

---

## ptr_hash configuration (phase 2)

```rust
type Mphf = PtrHash<
    u64,                              // key: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;
```

**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.

**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.

**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). One cacheline per query; space win dominates at billion-scale key counts.

---

## Multilayer index architecture

### Layer structure

Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:

```
layer_i/
  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
  mphf.bin         — ptr_hash phase-2 MPHF
  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
```

Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:

1. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
2. Collect kmers of B not present in any layer → set `B \ A`.
3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).

### Membership verification

ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.

### Query algorithm

```
fn query(kmer) → Option<(layer_index, slot)>:
    for (i, layer) in layers.iter().enumerate():
        slot = layer.mphf.index(kmer)
        if layer.evidence.decode(slot) matches kmer:
            return Some((i, slot))
    return None
```

Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.

### Merging layers

Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								# MPHF selection — two-phase indexing architecture
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## Why two phases are needed
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								### Phase 1 — provisional MPHF + kmer spectrum
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Implemented in `obikpartitionner::KmerPartition::count_kmer()`.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
 . **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
 . **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
 . **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
 . **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Files produced per partition:
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								```
 								part_XXXXX/
 								  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
 								  counts1.bin             — [u32; n_kmers] kmer counts, mmap'd
 								  kmer_spectrum_raw.json  — local frequency spectrum
 								```
 								### Phase 2 — definitive MPHF
 								After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
 								`MphfLayer::build` is called on the unitig file:
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
 . **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
 								---
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## MPHF candidates
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								**boomphf** (BBHash algorithm, maintained by 10X Genomics):
 								- ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								- Supports streaming construction (no exact count needed)
 								- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								**ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
 								- Requires exact key count at construction — available at both phases after pass 1
 								- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								- ~2.1 bits/key — most compact; good query speed; deterministic construction
 								- Works well from an exact or slightly overestimated count
 								- `GOFunction` (group-oriented variant) is the specific type used
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
 								## MPHF choice per phase
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								boomphf is eliminated: largest space overhead, streaming advantage does not apply.
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
 								---
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
 								## Space at scale
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								| MPHF     | bits/key | Total MPHF size |
 								|----------|----------|-----------------|
 								| boomphf  | 3.7      | ~47 GB          |
 								| ptr_hash | 2.4      | ~31 GB          |
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								| FMPH     | 2.1      | ~27 GB          |
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											Refactor: Extract utility function for string reversal
										
										
											2026-04-27 20:23:44 +02:00
+								For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								---
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								## ptr_hash configuration (phase 2)
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								```rust
 								type Mphf = PtrHash<
 								    u64,                              // key: canonical kmer raw encoding
 								    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
 								    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
 								    Xx64,                             // hasher: XXH3-64 with seed
 								    Vec<u8>,                          // pilots
 								>;
 								```
-											first implementation but far to be optimal
										
										
											2026-04-16 22:38:20 +02:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). One cacheline per query; space win dominates at billion-scale key counts.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								---
 								## Multilayer index architecture
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								### Layer structure
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								```
 								layer_i/
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
 								  mphf.bin         — ptr_hash phase-2 MPHF
 								  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								```
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
 . Collect kmers of B not present in any layer → set `B \ A`.
 . Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								### Membership verification
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								### Query algorithm
 								```
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								fn query(kmer) → Option<(layer_index, slot)>:
 								    for (i, layer) in layers.iter().enumerate():
 								        slot = layer.mphf.index(kmer)
 								        if layer.evidence.decode(slot) matches kmer:
 								            return Some((i, slot))
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
+								    return None
 								```
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
-											feat: implement persistent layered index and chunked binary format
										
										
											2026-05-09 17:20:08 +08:00
 								### Merging layers
-											docs: clarify MPHF indexing, storage layout, and distance traits
										
										
											2026-05-17 10:20:22 +08:00
+								Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.