docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -134,28 +134,28 @@ Output: `unitigs.bin` — the permanent evidence structure for the partition. Ea

 ## Phase 6 — MPHF construction and index finalisation

-Built once on the definitive kmer set (all kmers in all unitigs of the partition):
+Built once on the definitive kmer set (all kmers in all unitigs of the partition). See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for the current implementation.

 ```
 kmers from unitigs → MPHF → mphf.bin
-                   → counts.bin : packed n-bit array (or 1-bit for presence mode)
-                   → refs.bin   : u32 nucleotide offset into unitigs.bin per kmer
+                   → evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
+                   → payload      : counts/ (mode 2) or presence/ (mode 3)
 ```

-The MPHF is built once — no rebuild. The n-bit width for `counts.bin` is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into `overflow.bin` as sorted `(mphf_index: u32, count: u32)` pairs.
+The MPHF is built in two passes over `unitigs.bin`: parallel pass for `mphf.bin`, sequential pass for `evidence.bin` and payload. The exact kmer count is available from the unitig index (`unitigs.bin.idx`) before the passes begin.

 **Exact verification via unitig evidence:**

-`unitigs.bin` serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.
+`unitigs.bin` serves as the evidence structure. The MPHF maps every input to `[0, N)` including absent kmers — the unitig read-back (via `evidence.bin`) is the only correct membership test.

 ```
 query kmer q
-  → canonical_minimizer(q) → hash → PART → part_XXXX/
-  → MPHF(q) → index i
-  → refs[i] = (unitig_id, kmer_offset)
-  → read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
-  → match   : return counts[i]   ← exact hit
-  → no match: kmer absent        ← MPHF collision on absent kmer
+  → canonical_minimizer(q) → hash → PART → part_XXXXX/
+  → MPHF(q) → slot s
+  → evidence[s] = (chunk_id, rank)
+  → read k nucleotides at rank in unitigs[chunk_id] → compare with q
+  → match   : return payload[s]   ← exact hit
+  → no match: kmer absent         ← MPHF collision on absent kmer
 ```

-One random disk access into `unitigs.bin` per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). `superkmers.bin.gz` is no longer needed at this point and can be deleted.
+`superkmers.bin.gz` is no longer needed at this point and can be deleted.