Files
obikmer/docmd/implementation/evidence_elimination.md
T
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

179 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Approximate evidence: fingerprint-based index
## Motivation
`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
| file | size/partition | fraction |
|---|---|---|
| evidence.bin | 132 MB | 66 % |
| unitigs.bin | 58 MB | 29 % |
| mphf.bin | 10 MB | 5 % |
`evidence.bin` is a bijection from MPHF-space to unitig-position-space and
costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
only ~22 % packing headroom. Compression is not a path to elimination.
The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
operations still require an exact index; the approximate index targets query
workloads that can tolerate a bounded false-positive rate.
---
## The Findere model
A B-bit fingerprint stored per MPHF slot provides the discrimination that
`evidence.bin` would otherwise provide through full k-mer reconstruction.
For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
```
P(FP per k-mer) = 1 / 2^b
```
The Findere trick raises the effective window to z consecutive k-mers. A query
succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
```
P(FP per z-window) = 1 / 2^(b·z)
```
The effective indexed k-mer length is `k z + 1`: a query for a (k+z1)-mer
decomposes into z overlapping k-mers, all of which must match.
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
---
## `FingerprintVec` on disk
`fingerprint.bin` layout:
```
magic: b"FPVF" (4 bytes)
b: u8 (bits per slot, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
```
`FingerprintVec` is memory-mapped. The match check against a query k-mer:
```rust
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
self.get(slot) == (fingerprint & self.mask)
}
```
`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
and `layer_meta.json`. No `.idx` file is produced; random access into
`unitigs.bin` is not needed.
At build time, `find_approx` in `MphfLayer`:
```rust
let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
```
---
## `EvidenceKind` and metadata
`layer_meta.json` records which evidence bundle is present:
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
`find_approx` transparently. `find_exact` panics on an approximate layer;
`find_approx` panics on an exact layer — mode mixing is a programming error.
---
## Parameter resolution (`resolve_approx_params`)
The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
rounding:
| given | derived |
|---|---|
| b, z | fp = 1/2^(b·z) |
| z, fp | b = ⌈−log₂(fp) / z⌉ |
| b, fp | z = ⌈−log₂(fp) / b⌉ |
| z only | b = 8 (default), fp derived |
| b only | z = 1 (default), fp derived |
| fp only | b = 8 (default), z derived |
| none | b = 8, z = 1, fp = 1/256 |
When all three are given, b and z are authoritative and fp is recomputed.
---
## CLI flags
Both `index` and `reindex` accept the same flags:
| flag | type | meaning |
|---|---|---|
| `--approx` | bool | enable fingerprint evidence |
| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
| `-z` | u8 | Findere z parameter |
| `--fp` | f64 | target FP rate per z-window |
| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |
`--approx` must be set explicitly; the other three flags are optional and
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
---
## `reindex` command
`reindex` converts an existing index between exact and approximate evidence
in-place across all partitions and layers, running partitions in parallel via
Rayon.
Conversion to approximate (`--approx`):
- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
- Removes `evidence.bin` and `unitigs.bin.idx`.
- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.
Conversion to exact (default, no `--approx`):
- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
- Removes `fingerprint.bin`.
- Updates `layer_meta.json` with `EvidenceKind::Exact`.
The root `index.meta` is updated with the new evidence kind on success.
`mphf.bin` and `unitigs.bin` are never modified.
---
## `estimate` command
`estimate` is a dry-run that resolves and prints (b, z, fp) without touching
any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
additionally accepts `-k` to display the effective indexed k-mer length:
```
k (query): 31
k (indexed): 31
z: 1
evidence bits (b): 8
FP per k-mer: 3.906e-3 (1/2^8)
FP per z-window: 3.906e-3 (1/2^8)
```
Useful for choosing parameters before committing to an index build.