Files
obikmer/docmd/implementation/evidence_elimination.md
T
Eric Coissac 4a5ab0b8c2 feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
2026-05-23 12:53:42 +02:00

322 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Evidence elimination — design discussion
## Problem statement
`evidence.bin` maps each MPHF slot to a position in the unitig store so that
query verification is possible: given a slot `s` returned by `mphf.index(kmer)`,
retrieve the k-mer stored at that position and compare with the query.
On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition):
| file | size/partition | total (2048 parts) | fraction of lookup layer |
|---|---|---|---|
| evidence.bin | 132 MB | ~270 GB | **66 %** |
| unitigs.bin | 58 MB | ~118 GB | 29 % |
| mphf.bin | 10 MB | ~20 GB | 5 % |
Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage
optimisation available for index size.
---
## Why evidence exists
PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`.
For a query k-mer not in the indexed set, the returned slot is meaningless but
indistinguishable from a real hit without external information. Evidence provides
that information: `evidence[s]` encodes the location of the k-mer that legitimately
occupies slot `s`, allowing the verification:
```
slot = mphf.index(query)
(chunk_id, rank) = evidence.decode(slot)
stored_kmer = unitigs.kmer_at(chunk_id, rank)
return canonical(stored_kmer) == canonical(query)
```
Evidence is a **permutation** from MPHF-space to unitig-position-space.
Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.
---
## Information-theoretic lower bound
For a partition with P k-mers and U unitigs of average length m_u k-mers:
- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits
Current implementation: 25 + 7 = 32 bits (aligned u32).
Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.
**Packing headroom: ~22 %.** Not a path to elimination.
---
## Two-index architecture
The exact index is mandatory for set operations (union, intersection, diff) and
exact k-mer retrieval. A separate approximate index, built for query operations,
can tolerate a controlled false positive rate in exchange for a much smaller
footprint.
| component | exact index | approximate index |
|---|---|---|
| `mphf.bin` | ✓ | ✓ (same structure) |
| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ |
| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) |
| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) |
| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) |
The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps
`unitigs.bin` for sequential enumeration of K-mers.
---
## MPHF as a perfect Bloom filter
A standard Bloom filter with a single hash function and N bits storing M keys has
occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of
landing on a set bit. The empty space (fraction 1 M/N of bits at 0) is what
rejects foreign k-mers.
An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer
occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any**
input to a slot in [0, M) — there is no empty space. Every query lands on an
occupied slot. The MPHF alone cannot reject foreign k-mers at all.
Adding a B-bit fingerprint restores the discrimination:
```
slot = mphf.index(query)
fingerprint = hash(query) & mask_B
present = fingerprint_table[slot] == fingerprint
```
The fingerprint plays the role of the sparse space in the Bloom filter: it provides
the B bits of information needed to reject foreign k-mers.
Both structures reach the same fundamental cost for a given FP rate. For 1% FP:
- Bloom filter (optimal, k hash functions): ~9.6 bits/key
- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key
This is a fundamental bound, not an implementation detail.
---
## Approach A — MPHF + fingerprint (approximate index)
### Size
| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) |
|---|---|---|
| 8 | 33 MB | 4× smaller |
| 12 | 49 MB | 2.7× smaller |
| 16 | 66 MB | 2× smaller |
Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer).
### False positive rate — per k-mer query
For a specific non-indexed query k-mer q:
1. MPHF(q) → slot s, some value in [0, M)
2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
3. FP event: hash(q) & mask_B == fingerprint_table[s]
Since q is not the legitimate k-mer at s, its fingerprint is independent of
fingerprint_table[s], giving:
```
P(FP per k-mer) = 1 / 2^B
```
This is the probability of error **for one specific query k-mer**. It is not the
fraction of the k-mer universe that would be misclassified: querying all 4^k
possible k-mers would yield (4^k M)/2^B false positives in absolute terms, but
that is not the relevant quantity for practical use.
### Equivalence classes
The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B
equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true
indexed k-mer and 4^k/(M·2^B) 1 false positives. A larger M (fewer partitions)
produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B
remains constant.
### Read-level use case
The relevant decision unit is the **read**, not the individual k-mer. For a read
of ~100 nucleotides and k=31, there are ~70 k-mers.
- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are
true positives → high coverage fraction.
- A plant read queried against a bacterial index: k-mers are foreign; each has
P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.
A coverage threshold separates the two cases decisively. This is the same
principle as Findere: local coverage continuity distinguishes true hits from noise.
---
## Approach B — z-consecutive k-mer matching
A query for a K-mer of size K = k + z 1 decomposes into z overlapping k-mers.
Declaring a match only when **all z are present** reduces the per-window FP rate:
```
P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)
```
For a read with ~70 k-mers, there are ~70 z + 1 independent windows of size z.
The probability that at least one window is a false positive:
```
P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)
```
For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸.
A plant read is misclassified as bacterial roughly once in 60 million reads —
negligible for any practical dataset.
### Choosing B from (z, L, P_target)
z is a query-time parameter and does not affect the index structure. However,
knowing z at build time allows computing the minimum B required to reach a target
FP rate P_target for reads of length L (giving W = L k z + 2 independent
windows):
```
P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z )
```
Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**.
(B, z) are co-determined at build time to minimise fingerprint size while
guaranteeing the target read-level FP rate.
### Combined sizing
| B | z | K = k+z1 | P(FP_read) | fingerprint.bin/partition |
|---|---|---|---|---|
| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB |
| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB |
| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB |
| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB |
Smaller B → smaller fingerprint table; larger z → longer minimum match length K
and fewer independent windows per read.
---
## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index)
Build the MPHF to output the global k-mer position directly:
```
mphf: kmer → global_pos ∈ [0, P)
```
Verification becomes:
```
global_pos = mphf.index(query)
stored_kmer = unitigs.kmer_at_global_pos(global_pos)
return canonical(stored_kmer) == canonical(query)
```
No evidence array. The unitig block index (see below) provides
`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time.
### What is required
A **retrieval data structure** (also called a value-based or function-based MPHF):
given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`,
build a compact structure that maps each key to its assigned value.
Known constructions:
- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph +
XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic
minimum. Construction: O(n). Query: O(1).
- **SSHash approach**: builds the MPHF to map k-mers to their positions in a
concatenated unitig string. Achieves elimination of external evidence using a
"skew" construction that aligns the MPHF output with the sequential unitig layout.
### Rust availability
No Rust crate implements a retrieval data structure suitable for this use case as
of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain
MPHFs. **This is the key blocking factor.**
### SSHash construction (reference)
SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over
(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential
positions in the concatenated unitig string. A port to obikmer would require:
1. Concatenating all unitig sequences into a single flat buffer per partition.
2. Assigning each k-mer a global position (its offset in that buffer).
3. Building the MPHF to output that position directly (retrieval step).
4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`.
---
## Approach 2 — block index prefix sums (reduces evidence to negligible)
A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled
offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset
table.
### Extension: k-mer prefix sums per block
Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before
block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**.
This enables `kmer_at_global_pos(p)`:
1. Binary search in `kmer_prefix[]` to find block `b`.
2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`.
3. Extract the k-mer at the remaining rank within the found chunk.
Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.
### Combined with Approach 1
- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions)
- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)
---
## Recommended path
1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose
(B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep
`unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8
vs ~142 MB for the exact lookup layer.
2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`.
Zero cost if evidence.bin is kept; enables Approach 1 when ready.
3. **Medium term**: implement GOV retrieval data structure in Rust, or port
SSHash construction.
4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected
index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.
---
## Open questions
- Is a GOV construction compatible with the parallel MPHF build currently used
(PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential
(hypergraph peeling); parallelisation is non-trivial.
- Can the SSHash "skew" insight be reused without the minimizer-bucket structure?
The obikmer partitioning already uses minimizers — there may be natural alignment.
- What is the query latency impact of replacing O(1) evidence lookup with
O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale.
- What is the optimal (B, z) trade-off for the approximate index given the target
read length and acceptable P(FP_read)?