docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 10:20:22 +08:00
parent cf693f17f2
commit f36b095ce2
17 changed files with 916 additions and 1031 deletions
@@ -191,35 +191,38 @@ Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)`

 ## Implementation notes

-### Evidence file layout (strategy B)
+### Evidence file layout (strategy B — implemented)
+
+`evidence.bin` is a flat `[u32; n]` array with no header:

 ```
-evidence.bin
-├── header    : k (u8), n_kmers (u64), n_unitigs (u64)
-├── id_array  : n_kmers × ⌈log₂ n_unitigs⌉ bits  — MPHF slot → unitig_id
-└── rank_array: n_kmers × 8 bits (u8[n_kmers])    — MPHF slot → rank within unitig
+evidence.bin: n × 4 bytes, little-endian
+  each u32:  bits [31:7] = chunk_id (25 bits)
+             bits [6:0]  = rank     (7 bits)
 ```

-`id_array` is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for *B. nana* at 256 partitions). `rank_array` is a plain `u8` array — no bit-packing needed. Access is O(1) with a single multiplication and mask for `id_array`, and a direct byte index for `rank_array`.
+File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
+
+The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.

 ### Unitig file layout

-FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel `.2bit` file for speed.
-
-```
->c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
-ACGTGGCTA...
-```
+Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.

 ### Decoding a kmer from slot s

 ```
-unitig_id = id_array[s]
-rank      = rank_array[s]
-kmer      = nucleotides(unitig_id)[rank .. rank + k]   // 2-bit packed slice
+(chunk_id, rank) = evidence.decode(s)          // u32 → (u25, u7)
+kmer = unitigs.raw_kmer(chunk_id, rank)        // 2-bit packed slice, k nucleotides
 ```

-One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).
+Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
+
+### Field widths in practice
+
+Rank is stored in 7 bits (0–127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
+
+chunk_id is stored in 25 bits (0–33 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.

 ### Forward vs reverse complement

@@ -264,6 +267,4 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv

 ## Open questions

- **Rank field width**: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On *B. nana* (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
- **Packed nucleotide cache**: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead ($N_{nuc} / 4$ bytes per partition).
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.
+- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.