4a5ab0b8c2
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
322 lines
12 KiB
Markdown
322 lines
12 KiB
Markdown
# Evidence elimination — design discussion
|
||
|
||
## Problem statement
|
||
|
||
`evidence.bin` maps each MPHF slot to a position in the unitig store so that
|
||
query verification is possible: given a slot `s` returned by `mphf.index(kmer)`,
|
||
retrieve the k-mer stored at that position and compare with the query.
|
||
|
||
On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition):
|
||
|
||
| file | size/partition | total (2048 parts) | fraction of lookup layer |
|
||
|---|---|---|---|
|
||
| evidence.bin | 132 MB | ~270 GB | **66 %** |
|
||
| unitigs.bin | 58 MB | ~118 GB | 29 % |
|
||
| mphf.bin | 10 MB | ~20 GB | 5 % |
|
||
|
||
Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage
|
||
optimisation available for index size.
|
||
|
||
---
|
||
|
||
## Why evidence exists
|
||
|
||
PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`.
|
||
For a query k-mer not in the indexed set, the returned slot is meaningless but
|
||
indistinguishable from a real hit without external information. Evidence provides
|
||
that information: `evidence[s]` encodes the location of the k-mer that legitimately
|
||
occupies slot `s`, allowing the verification:
|
||
|
||
```
|
||
slot = mphf.index(query)
|
||
(chunk_id, rank) = evidence.decode(slot)
|
||
stored_kmer = unitigs.kmer_at(chunk_id, rank)
|
||
return canonical(stored_kmer) == canonical(query)
|
||
```
|
||
|
||
Evidence is a **permutation** from MPHF-space to unitig-position-space.
|
||
Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.
|
||
|
||
---
|
||
|
||
## Information-theoretic lower bound
|
||
|
||
For a partition with P k-mers and U unitigs of average length m_u k-mers:
|
||
|
||
- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
|
||
- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits
|
||
|
||
Current implementation: 25 + 7 = 32 bits (aligned u32).
|
||
Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.
|
||
|
||
**Packing headroom: ~22 %.** Not a path to elimination.
|
||
|
||
---
|
||
|
||
## Two-index architecture
|
||
|
||
The exact index is mandatory for set operations (union, intersection, diff) and
|
||
exact k-mer retrieval. A separate approximate index, built for query operations,
|
||
can tolerate a controlled false positive rate in exchange for a much smaller
|
||
footprint.
|
||
|
||
| component | exact index | approximate index |
|
||
|---|---|---|
|
||
| `mphf.bin` | ✓ | ✓ (same structure) |
|
||
| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ |
|
||
| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) |
|
||
| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) |
|
||
| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) |
|
||
|
||
The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps
|
||
`unitigs.bin` for sequential enumeration of K-mers.
|
||
|
||
---
|
||
|
||
## MPHF as a perfect Bloom filter
|
||
|
||
A standard Bloom filter with a single hash function and N bits storing M keys has
|
||
occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of
|
||
landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what
|
||
rejects foreign k-mers.
|
||
|
||
An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer
|
||
occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any**
|
||
input to a slot in [0, M) — there is no empty space. Every query lands on an
|
||
occupied slot. The MPHF alone cannot reject foreign k-mers at all.
|
||
|
||
Adding a B-bit fingerprint restores the discrimination:
|
||
|
||
```
|
||
slot = mphf.index(query)
|
||
fingerprint = hash(query) & mask_B
|
||
present = fingerprint_table[slot] == fingerprint
|
||
```
|
||
|
||
The fingerprint plays the role of the sparse space in the Bloom filter: it provides
|
||
the B bits of information needed to reject foreign k-mers.
|
||
|
||
Both structures reach the same fundamental cost for a given FP rate. For 1% FP:
|
||
|
||
- Bloom filter (optimal, k hash functions): ~9.6 bits/key
|
||
- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key
|
||
|
||
This is a fundamental bound, not an implementation detail.
|
||
|
||
---
|
||
|
||
## Approach A — MPHF + fingerprint (approximate index)
|
||
|
||
### Size
|
||
|
||
| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) |
|
||
|---|---|---|
|
||
| 8 | 33 MB | 4× smaller |
|
||
| 12 | 49 MB | 2.7× smaller |
|
||
| 16 | 66 MB | 2× smaller |
|
||
|
||
Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer).
|
||
|
||
### False positive rate — per k-mer query
|
||
|
||
For a specific non-indexed query k-mer q:
|
||
|
||
1. MPHF(q) → slot s, some value in [0, M)
|
||
2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
|
||
3. FP event: hash(q) & mask_B == fingerprint_table[s]
|
||
|
||
Since q is not the legitimate k-mer at s, its fingerprint is independent of
|
||
fingerprint_table[s], giving:
|
||
|
||
```
|
||
P(FP per k-mer) = 1 / 2^B
|
||
```
|
||
|
||
This is the probability of error **for one specific query k-mer**. It is not the
|
||
fraction of the k-mer universe that would be misclassified: querying all 4^k
|
||
possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but
|
||
that is not the relevant quantity for practical use.
|
||
|
||
### Equivalence classes
|
||
|
||
The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B
|
||
equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true
|
||
indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions)
|
||
produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B
|
||
remains constant.
|
||
|
||
### Read-level use case
|
||
|
||
The relevant decision unit is the **read**, not the individual k-mer. For a read
|
||
of ~100 nucleotides and k=31, there are ~70 k-mers.
|
||
|
||
- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are
|
||
true positives → high coverage fraction.
|
||
- A plant read queried against a bacterial index: k-mers are foreign; each has
|
||
P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.
|
||
|
||
A coverage threshold separates the two cases decisively. This is the same
|
||
principle as Findere: local coverage continuity distinguishes true hits from noise.
|
||
|
||
---
|
||
|
||
## Approach B — z-consecutive k-mer matching
|
||
|
||
A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers.
|
||
Declaring a match only when **all z are present** reduces the per-window FP rate:
|
||
|
||
```
|
||
P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)
|
||
```
|
||
|
||
For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z.
|
||
The probability that at least one window is a false positive:
|
||
|
||
```
|
||
P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)
|
||
```
|
||
|
||
For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸.
|
||
|
||
A plant read is misclassified as bacterial roughly once in 60 million reads —
|
||
negligible for any practical dataset.
|
||
|
||
### Choosing B from (z, L, P_target)
|
||
|
||
z is a query-time parameter and does not affect the index structure. However,
|
||
knowing z at build time allows computing the minimum B required to reach a target
|
||
FP rate P_target for reads of length L (giving W = L − k − z + 2 independent
|
||
windows):
|
||
|
||
```
|
||
P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z )
|
||
```
|
||
|
||
Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**.
|
||
|
||
(B, z) are co-determined at build time to minimise fingerprint size while
|
||
guaranteeing the target read-level FP rate.
|
||
|
||
### Combined sizing
|
||
|
||
| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition |
|
||
|---|---|---|---|---|
|
||
| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB |
|
||
| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB |
|
||
| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB |
|
||
| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB |
|
||
|
||
Smaller B → smaller fingerprint table; larger z → longer minimum match length K
|
||
and fewer independent windows per read.
|
||
|
||
---
|
||
|
||
## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index)
|
||
|
||
Build the MPHF to output the global k-mer position directly:
|
||
|
||
```
|
||
mphf: kmer → global_pos ∈ [0, P)
|
||
```
|
||
|
||
Verification becomes:
|
||
|
||
```
|
||
global_pos = mphf.index(query)
|
||
stored_kmer = unitigs.kmer_at_global_pos(global_pos)
|
||
return canonical(stored_kmer) == canonical(query)
|
||
```
|
||
|
||
No evidence array. The unitig block index (see below) provides
|
||
`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time.
|
||
|
||
### What is required
|
||
|
||
A **retrieval data structure** (also called a value-based or function-based MPHF):
|
||
given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`,
|
||
build a compact structure that maps each key to its assigned value.
|
||
|
||
Known constructions:
|
||
|
||
- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph +
|
||
XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic
|
||
minimum. Construction: O(n). Query: O(1).
|
||
- **SSHash approach**: builds the MPHF to map k-mers to their positions in a
|
||
concatenated unitig string. Achieves elimination of external evidence using a
|
||
"skew" construction that aligns the MPHF output with the sequential unitig layout.
|
||
|
||
### Rust availability
|
||
|
||
No Rust crate implements a retrieval data structure suitable for this use case as
|
||
of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain
|
||
MPHFs. **This is the key blocking factor.**
|
||
|
||
### SSHash construction (reference)
|
||
|
||
SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over
|
||
(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential
|
||
positions in the concatenated unitig string. A port to obikmer would require:
|
||
|
||
1. Concatenating all unitig sequences into a single flat buffer per partition.
|
||
2. Assigning each k-mer a global position (its offset in that buffer).
|
||
3. Building the MPHF to output that position directly (retrieval step).
|
||
4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`.
|
||
|
||
---
|
||
|
||
## Approach 2 — block index prefix sums (reduces evidence to negligible)
|
||
|
||
A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled
|
||
offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset
|
||
table.
|
||
|
||
### Extension: k-mer prefix sums per block
|
||
|
||
Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before
|
||
block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**.
|
||
|
||
This enables `kmer_at_global_pos(p)`:
|
||
|
||
1. Binary search in `kmer_prefix[]` to find block `b`.
|
||
2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`.
|
||
3. Extract the k-mer at the remaining rank within the found chunk.
|
||
|
||
Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.
|
||
|
||
### Combined with Approach 1
|
||
|
||
- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions)
|
||
- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)
|
||
|
||
---
|
||
|
||
## Recommended path
|
||
|
||
1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose
|
||
(B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep
|
||
`unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8
|
||
vs ~142 MB for the exact lookup layer.
|
||
|
||
2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`.
|
||
Zero cost if evidence.bin is kept; enables Approach 1 when ready.
|
||
|
||
3. **Medium term**: implement GOV retrieval data structure in Rust, or port
|
||
SSHash construction.
|
||
|
||
4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected
|
||
index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.
|
||
|
||
---
|
||
|
||
## Open questions
|
||
|
||
- Is a GOV construction compatible with the parallel MPHF build currently used
|
||
(PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential
|
||
(hypergraph peeling); parallelisation is non-trivial.
|
||
- Can the SSHash "skew" insight be reused without the minimizer-bucket structure?
|
||
The obikmer partitioning already uses minimizers — there may be natural alignment.
|
||
- What is the query latency impact of replacing O(1) evidence lookup with
|
||
O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale.
|
||
- What is the optimal (B, z) trade-off for the approximate index given the target
|
||
read length and acceptable P(FP_read)?
|