da56c3e290
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
181 lines
8.8 KiB
Markdown
181 lines
8.8 KiB
Markdown
# Unitig-based MPHF evidence encoding
|
||
|
||
## Role of unitigs in the index
|
||
|
||
The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
|
||
|
||
Unitigs are the natural compact representation: a run of L nucleotides encodes L − k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
|
||
|
||
---
|
||
|
||
## Binary file formats
|
||
|
||
### `unitigs.bin` — sequence chunks
|
||
|
||
A sequence of binary records. Each record:
|
||
|
||
```
|
||
[u8: seql − k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
|
||
```
|
||
|
||
- `seql − k` (0–255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
|
||
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
|
||
- Byte count for packed sequence: `ceil(seql / 4)`.
|
||
|
||
Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql − k + 1 ≤ 256`); consecutive chunks overlap by k−1 nucleotides so no kmer is lost:
|
||
|
||
```
|
||
chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k − 2] (256 kmers)
|
||
chunk 2: nucleotides [256, end] (remaining kmers)
|
||
overlap: k−1 nucleotides shared between the two chunks
|
||
```
|
||
|
||
### `unitigs.bin.idx` — block-sampled offset index
|
||
|
||
```
|
||
magic : 4 bytes = "UIX3"
|
||
block_bits: u32 LE — granularity parameter (0–31)
|
||
n_unitigs : u32 LE — total number of chunks in unitigs.bin
|
||
n_kmers : u64 LE — total number of kmers across all chunks
|
||
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
|
||
```
|
||
|
||
One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).
|
||
|
||
### `evidence.bin` — per-slot MPHF evidence
|
||
|
||
A flat array of u32 values, one per MPHF slot, no header:
|
||
|
||
```
|
||
bits [31:7] = chunk_id (25 bits)
|
||
bits [6:0] = rank (7 bits, 0–127)
|
||
```
|
||
|
||
File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
|
||
|
||
---
|
||
|
||
## Building and reading the index
|
||
|
||
### `build_unitig_idx(path, block_bits)`
|
||
|
||
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
|
||
|
||
### `open()` vs `open_sequential()`
|
||
|
||
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
|
||
|
||
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
|
||
|
||
### `chunk_start(i)` — random access
|
||
|
||
```rust
|
||
fn chunk_start(&self, i: usize) -> usize {
|
||
// block_bits=0: single table lookup, O(1) — hot path
|
||
if self.block_bits == 0 {
|
||
return self.block_offsets[i] as usize;
|
||
}
|
||
// block_bits>0: lookup block, then scan at most 2^block_bits − 1 records
|
||
let block = i >> self.block_bits;
|
||
let rem = i & self.mask;
|
||
let mut offset = self.block_offsets[block] as usize;
|
||
for _ in 0..rem {
|
||
let seql_minus_k = self.mmap[offset] as usize;
|
||
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||
}
|
||
offset
|
||
}
|
||
```
|
||
|
||
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
|
||
|
||
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
|
||
|
||
### Decoding a kmer from slot `s`
|
||
|
||
```rust
|
||
let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
|
||
let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
|
||
```
|
||
|
||
Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
|
||
|
||
---
|
||
|
||
## Field widths and capacity
|
||
|
||
| field | bits | range | capacity check (*B. nana*, 256 partitions) |
|
||
|------------|------|---------------|---------------------------------------------|
|
||
| `seql − k` | 8 | 0–255 | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
|
||
| `rank` | 7 | 0–127 | observed max ~46 kmers/chunk; structural max k−m+1 = 21 |
|
||
| `chunk_id` | 25 | 0–33 554 431 | avg U ≈ 275 k chunks/partition |
|
||
|
||
The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
|
||
|
||
---
|
||
|
||
## Evidence bit-cost
|
||
|
||
Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
|
||
|
||
| field | theoretical cost | value |
|
||
|------------|-------------------------|---------|
|
||
| chunk_id | ⌈log₂ U⌉ | 19 bits |
|
||
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
|
||
| **stored** | aligned u32 | **32 bits/slot** |
|
||
|
||
The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
|
||
|
||
Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k−1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
|
||
|
||
---
|
||
|
||
## Unitig decomposition non-determinism
|
||
|
||
The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
|
||
|
||
The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
|
||
|
||
**Example** — fork topology (k = 31):
|
||
|
||
```
|
||
A → B ← C
|
||
↓
|
||
D
|
||
```
|
||
|
||
B has two left neighbours, so `can_extend_left = false`. Two valid tilings:
|
||
|
||
| iteration order | unitigs | count |
|
||
|---|---|---|
|
||
| A first | ABD, C | 2 |
|
||
| B first | BD, A, C | 3 |
|
||
|
||
Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
|
||
|
||
This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
|
||
|
||
---
|
||
|
||
## Partition-size tradeoff
|
||
|
||
Measured on *B. nana* (k=31, m=11), summing across all partitions:
|
||
|
||
| N partitions | m_u |
|
||
|---|---|
|
||
| 1 | 41.89 |
|
||
| 16 | 38.19 |
|
||
| 256 | 37.90 |
|
||
| 1 024 | 37.89 |
|
||
|
||
`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 16–1024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
|
||
|
||
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k−1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- **Cross-partition set operations**: strategy B allows unitig-level operations (mark entire chunks present/absent) rather than kmer-level, reducing cost by a factor of m_u.
|
||
- **Eliminating evidence.bin**: at ~66% of per-layer lookup footprint, `evidence.bin` dominates index size. See [Evidence elimination design discussion](evidence_elimination.md).
|