docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
This commit is contained in:
@@ -191,35 +191,38 @@ Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)`
|
||||
|
||||
## Implementation notes
|
||||
|
||||
### Evidence file layout (strategy B)
|
||||
### Evidence file layout (strategy B — implemented)
|
||||
|
||||
`evidence.bin` is a flat `[u32; n]` array with no header:
|
||||
|
||||
```
|
||||
evidence.bin
|
||||
├── header : k (u8), n_kmers (u64), n_unitigs (u64)
|
||||
├── id_array : n_kmers × ⌈log₂ n_unitigs⌉ bits — MPHF slot → unitig_id
|
||||
└── rank_array: n_kmers × 8 bits (u8[n_kmers]) — MPHF slot → rank within unitig
|
||||
evidence.bin: n × 4 bytes, little-endian
|
||||
each u32: bits [31:7] = chunk_id (25 bits)
|
||||
bits [6:0] = rank (7 bits)
|
||||
```
|
||||
|
||||
`id_array` is a compact bit-packed vector (width = ⌈log₂ n_unitigs⌉; 19 bits for *B. nana* at 256 partitions). `rank_array` is a plain `u8` array — no bit-packing needed. Access is O(1) with a single multiplication and mask for `id_array`, and a direct byte index for `rank_array`.
|
||||
File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
||||
|
||||
The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
|
||||
|
||||
### Unitig file layout
|
||||
|
||||
FASTA with JSON annotation header (xxHash-64 ID, seq_length, kmer_size, n_kmers). The nucleotide sequence is stored in ASCII uppercase; a 2-bit packed version is derived at query time or stored as a parallel `.2bit` file for speed.
|
||||
|
||||
```
|
||||
>c4a1e7f2 {"seq_length":87,"kmer_size":31,"n_kmers":57}
|
||||
ACGTGGCTA...
|
||||
```
|
||||
Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
|
||||
|
||||
### Decoding a kmer from slot s
|
||||
|
||||
```
|
||||
unitig_id = id_array[s]
|
||||
rank = rank_array[s]
|
||||
kmer = nucleotides(unitig_id)[rank .. rank + k] // 2-bit packed slice
|
||||
(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
|
||||
kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
|
||||
```
|
||||
|
||||
One array lookup per field, then a packed slice extraction. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the graph).
|
||||
Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
|
||||
|
||||
### Field widths in practice
|
||||
|
||||
Rank is stored in 7 bits (0–127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
|
||||
|
||||
chunk_id is stored in 25 bits (0–33 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
|
||||
|
||||
### Forward vs reverse complement
|
||||
|
||||
@@ -264,6 +267,4 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Rank field width**: u8 covers 255 kmers; storing lengths and ranks in kmer units (not nucleotides) buys k−1 extra units of headroom at no cost. On *B. nana* (k=31), m_u ≈ 38 — well within u8 range on average, but the maximum unitig length has not been measured yet. For genomes with very long unitigs, u16 may be needed; the header could record the actual width if portability is required.
|
||||
- **Packed nucleotide cache**: storing a 2-bit packed nucleotide array alongside the FASTA avoids re-encoding at query time; negligible space overhead ($N_{nuc} / 4$ bytes per partition).
|
||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m.
|
||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
||||
|
||||
Reference in New Issue
Block a user