docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -1,57 +1,68 @@
 # MPHF selection — two-phase indexing architecture

-## Indexing architecture
+## Why two phases are needed

-Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.
+Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.

-### Superkmer vs kmer counts
+### Phase 1 — provisional MPHF + kmer spectrum

-The `SKFileMeta` sidecar written by `SKFileWriter` records `instances` (unique superkmers) and `length_sum` (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as `length_sum − instances × (k − 1)`. This is an **overestimate** of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.
+Implemented in `obikpartitionner::KmerPartition::count_kmer()`.

-Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.
+1. **Pass 1**: read the dereplicated superkmer file; enumerate all unique canonical kmers into a `HashSet`. Exact count known after this pass.
+2. **Build a provisional MPHF** (`GOFunction` from the `ph` crate) over the exact kmer set. Produces `mphf1.bin`.
+3. **Create `counts1.bin`**: one zero-initialised `u32` per MPHF slot (mmap'd).
+4. **Pass 2**: re-read the dereplicated file; for each kmer, query `mphf1.get(kmer)` and atomically accumulate the superkmer count into `counts1[slot]`.
+5. **Build kmer frequency spectrum** from `counts1`: histogram `{count → n_kmers}`, totals f0 (distinct kmers) and f1 (total abundance). Written to `kmer_spectrum_raw.json` per partition, then merged globally.

-### Phase 1 — provisional index and spectrum
+Files produced per partition:

-1. Enumerate all kmers from the dereplicated superkmers of the partition.
-2. Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).
-3. Accumulate counts: for each kmer in each superkmer, `count[MPHF(kmer)] += sk.count()`.
-4. Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).
-5. Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.
-6. Discard the provisional MPHF.
+```
+part_XXXXX/
+  mphf1.bin               — GOFunction (provisional MPHF, discarded after phase 2)
+  counts1.bin             — [u32; n_kmers] kmer counts, mmap'd
+  kmer_spectrum_raw.json  — local frequency spectrum
+```

-### Phase 2 — definitive index
+### Phase 2 — definitive MPHF

-Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).
+After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
+
+`MphfLayer::build` is called on the unitig file:
+
+1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
+2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
+
+`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.

 ---

-## Candidates
+## MPHF candidates

 **boomphf** (BBHash algorithm, maintained by 10X Genomics):

 - ~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)
- Parallel construction; well-tested with DNA kmer data at scale
- Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2
+- Supports streaming construction (no exact count needed)
+- Drawback: largest space footprint; streaming advantage is irrelevant at phase 2 since the exact count is available

 **ptr_hash** (PtrHash algorithm, Groot Koerkamp, SEA 2025):

- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)
- Requires exact key count at construction — available at phase 2
- Drawback: published February 2025 — very young, no production track record
+- ~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64) and fastest construction (≥3.1×)
+- Requires exact key count at construction — available at both phases after pass 1
+- Published February 2025; accepted given performance profile and the fact that each MPHF is independently rebuildable from its unitig file

-**FMPHGO** (`ph` crate, Beling, ACM JEA 2023):
+**FMPH/FMPHGO** (`ph` crate, Beling, ACM JEA 2023):

- ~2.1 bits/key — most compact of the three; good query speed; parallelisable construction
- More established than ptr_hash; actively maintained
- Works well with overestimated capacity → natural fit for phase 1
+- ~2.1 bits/key — most compact; good query speed; deterministic construction
+- Works well from an exact or slightly overestimated count
+- `GOFunction` (group-oriented variant) is the specific type used

 ## MPHF choice per phase

-**Phase 1** (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.
+**Phase 1** (provisional, discarded after spectrum computation): `ph::fmph::GOFunction`. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of `count_kmer`.

-**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available at phase 2, so ptr_hash operates optimally. Its query speed (≥2.1× over FMPHGO) and construction speed (≥3.1×) are meaningful for the persistent index; the space overhead at 2.4 bits/key is acceptable. The crate's youth (Feb 2025) was previously a concern; it is now accepted given the performance profile and the fact that each layer MPHF is independently rebuildable from its unitig file if needed.
+**Phase 2** (persistent, queried repeatedly): **ptr_hash**. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.

-boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.
+boomphf is eliminated: largest space overhead, streaming advantage does not apply.

 ---

@@ -63,74 +74,68 @@ For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):
 |----------|----------|-----------------|
 | boomphf  | 3.7      | ~47 GB          |
 | ptr_hash | 2.4      | ~31 GB          |
-| FMPHGO   | 2.1      | ~27 GB          |
+| FMPH     | 2.1      | ~27 GB          |

 For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.

-## On-disk and mmap considerations
+---

-All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the `ph` crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.
+## ptr_hash configuration (phase 2)

-No established Rust crate provides a natively on-disk MPHF. **SSHash** (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.
+```rust
+type Mphf = PtrHash<
+    u64,                              // key: canonical kmer raw encoding
+    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
+    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
+    Xx64,                             // hasher: XXH3-64 with seed
+    Vec<u8>,                          // pilots
+>;
+```
+
+**Hasher — `Xx64`**: canonical kmer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly; `Xx64` (XXH3-64, seeded) handles structured input correctly.
+
+**Bucket function — `CubicEps`**: λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time — not chosen.
+
+**Remap — `CachelineEfVec`**: Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec<u32>`). One cacheline per query; space win dominates at billion-scale key counts.

 ---

 ## Multilayer index architecture

-### Motivation
-
-An index built from a single dataset A can be extended with a new dataset B without rebuilding. This supports incremental construction (adding species, samples, or sequencing runs) and enables set operations across heterogeneous sources.
-
 ### Layer structure

-Each layer is a self-contained unit:
+Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for the full on-disk layout. The MPHF-relevant files are:

 ```
 layer_i/
-  unitigs.bin     — packed 2-bit nucleotide sequences
-  mphf.bin        — ptr_hash index (phase-2, exact key count)
-  evidence.bin    — [(unitig_id, rank)] per MPHF slot  (see unitig_evidence.md)
-  counts.bin      — [u32] per MPHF slot
+  unitigs.bin      — packed 2-bit nucleotide sequences (kmer evidence)
+  mphf.bin         — ptr_hash phase-2 MPHF
+  evidence.bin     — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
 ```

-Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B proceeds as follows:
+Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:

-1. For each kmer in B: query layer 0 — if found, accumulate count into `counts_0[MPHF_0(kmer)]`.
-2. Collect all kmers of B not present in any existing layer → set `B \ A`.
-3. Build layer 1 from `B \ A` using the standard two-phase pipeline (spectrum, filter, ptr_hash).
-
-Adding a third dataset C repeats the process: probe layer 0, then layer 1, then build layer 2 from `C \ A \ B`.
+1. For each kmer in B: probe existing layers. If found, the kmer is already indexed.
+2. Collect kmers of B not present in any layer → set `B \ A`.
+3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).

 ### Membership verification

-ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(unitig_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
-
-This makes the evidence layer load-bearing for correctness, not only for locality.
+ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.

 ### Query algorithm

 ```
-fn query(kmer) → Option<count>:
-    for layer in layers:
-        slot = layer.mphf.query(kmer)
-        if layer.evidence.decode(slot) == kmer:
-            return Some(layer.counts[slot])
+fn query(kmer) → Option<(layer_index, slot)>:
+    for (i, layer) in layers.iter().enumerate():
+        slot = layer.mphf.index(kmer)
+        if layer.evidence.decode(slot) matches kmer:
+            return Some((i, slot))
    return None
 ```

-Expected probe depth: 1 for kmers present in layer 0, increasing for rare kmers added in later layers. In practice, the dominant dataset (largest A) should be layer 0 to minimise average probe depth.
-
-### Layer count and probe cost
-
-Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode (two array accesses). For L layers the worst case is L probes + 1 None. In practice L is small (2–5 for typical multi-species databases). No global data structure is needed to route queries; the layer chain is traversed in order.
+Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.

 ### Merging layers

-Two layer chains can be merged by re-indexing their union through the standard pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.
-
-## Open questions
-
- Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.
- **rkyv integration**: all flat arrays in a layer (evidence, counts, presence/absence matrix) map trivially to `rkyv::Archive` — fixed-size element types, no heap indirection. The presence/absence matrix is the strongest case: at 10 M kmers × 1 000 samples ≈ 1.25 GB per partition, zero-copy mmap via rkyv avoids loading the entire matrix at open time, letting the OS page cache serve only accessed pages. ptr_hash itself is internally a flat bit array and is structurally compatible with rkyv, but requires either native crate support or a wrapper. Assess the wrapper cost and whether ptr_hash is willing to adopt rkyv upstream.
- Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.
- Determine optimal layer ordering heuristic (by kmer count? by query frequency?) for multi-species databases.
+Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.