docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
@@ -134,28 +134,28 @@ Output: `unitigs.bin` — the permanent evidence structure for the partition. Ea
|
||||
|
||||
## Phase 6 — MPHF construction and index finalisation
|
||||
|
||||
Built once on the definitive kmer set (all kmers in all unitigs of the partition):
|
||||
Built once on the definitive kmer set (all kmers in all unitigs of the partition). See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for the current implementation.
|
||||
|
||||
```
|
||||
kmers from unitigs → MPHF → mphf.bin
|
||||
→ counts.bin : packed n-bit array (or 1-bit for presence mode)
|
||||
→ refs.bin : u32 nucleotide offset into unitigs.bin per kmer
|
||||
→ evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
|
||||
→ payload : counts/ (mode 2) or presence/ (mode 3)
|
||||
```
|
||||
|
||||
The MPHF is built once — no rebuild. The n-bit width for `counts.bin` is chosen from the observed count distribution (n=5 covers ~97% of kmers at 15x; n=1 for presence mode). Counts exceeding 2ⁿ−1 go into `overflow.bin` as sorted `(mphf_index: u32, count: u32)` pairs.
|
||||
The MPHF is built in two passes over `unitigs.bin`: parallel pass for `mphf.bin`, sequential pass for `evidence.bin` and payload. The exact kmer count is available from the unitig index (`unitigs.bin.idx`) before the passes begin.
|
||||
|
||||
**Exact verification via unitig evidence:**
|
||||
|
||||
`unitigs.bin` serves as the evidence structure: for any query kmer, the stored unitig provides the ground truth to confirm or deny its presence. The MPHF maps every input to [0, N) including absent kmers — the unitig read-back is the only way to guarantee exactness.
|
||||
`unitigs.bin` serves as the evidence structure. The MPHF maps every input to `[0, N)` including absent kmers — the unitig read-back (via `evidence.bin`) is the only correct membership test.
|
||||
|
||||
```
|
||||
query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXX/
|
||||
→ MPHF(q) → index i
|
||||
→ refs[i] = (unitig_id, kmer_offset)
|
||||
→ read unitig from unitigs.bin → extract kmer at kmer_offset → compare with q
|
||||
→ match : return counts[i] ← exact hit
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXXX/
|
||||
→ MPHF(q) → slot s
|
||||
→ evidence[s] = (chunk_id, rank)
|
||||
→ read k nucleotides at rank in unitigs[chunk_id] → compare with q
|
||||
→ match : return payload[s] ← exact hit
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
```
|
||||
|
||||
One random disk access into `unitigs.bin` per query; the unitig is the minimal, non-redundant evidence (each kmer stored once). `superkmers.bin.gz` is no longer needed at this point and can be deleted.
|
||||
`superkmers.bin.gz` is no longer needed at this point and can be deleted.
|
||||
|
||||
Reference in New Issue
Block a user