2026-05-23 13:24:25 +02:00
|
|
|
|
# Approximate evidence: fingerprint-based index
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## Motivation
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
|
|
|
|
|
|
enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
|
|
|
|
|
|
k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
| file | size/partition | fraction |
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|---|---|---|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
| evidence.bin | 132 MB | 66 % |
|
|
|
|
|
|
| unitigs.bin | 58 MB | 29 % |
|
|
|
|
|
|
| mphf.bin | 10 MB | 5 % |
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`evidence.bin` is a bijection from MPHF-space to unitig-position-space and
|
|
|
|
|
|
costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
|
|
|
|
|
|
only ~22 % packing headroom. Compression is not a path to elimination.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
|
|
|
|
|
|
`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
|
|
|
|
|
|
operations still require an exact index; the approximate index targets query
|
|
|
|
|
|
workloads that can tolerate a bounded false-positive rate.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
---
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## The Findere model
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
A B-bit fingerprint stored per MPHF slot provides the discrimination that
|
|
|
|
|
|
`evidence.bin` would otherwise provide through full k-mer reconstruction.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
|
|
|
|
|
|
stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
```
|
2026-05-23 13:24:25 +02:00
|
|
|
|
P(FP per k-mer) = 1 / 2^b
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
The Findere trick raises the effective window to z consecutive k-mers. A query
|
|
|
|
|
|
succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
```
|
2026-05-23 13:24:25 +02:00
|
|
|
|
P(FP per z-window) = 1 / 2^(b·z)
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer
|
|
|
|
|
|
decomposes into z overlapping k-mers, all of which must match.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## `FingerprintVec` on disk
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`fingerprint.bin` layout:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
```
|
2026-05-23 13:24:25 +02:00
|
|
|
|
magic: b"FPVF" (4 bytes)
|
|
|
|
|
|
b: u8 (bits per slot, 1..=64)
|
|
|
|
|
|
padding: [0u8; 3]
|
|
|
|
|
|
n: u64 LE (number of slots)
|
|
|
|
|
|
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`FingerprintVec` is memory-mapped. The match check against a query k-mer:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
```rust
|
|
|
|
|
|
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
|
|
|
|
|
|
self.get(slot) == (fingerprint & self.mask)
|
|
|
|
|
|
}
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
|
|
|
|
|
|
`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
|
|
|
|
|
|
and `layer_meta.json`. No `.idx` file is produced; random access into
|
|
|
|
|
|
`unitigs.bin` is not needed.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
At build time, `find_approx` in `MphfLayer`:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
```rust
|
|
|
|
|
|
let slot = self.mphf.index(&kmer.raw());
|
|
|
|
|
|
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## `EvidenceKind` and metadata
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`layer_meta.json` records which evidence bundle is present:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
```rust
|
|
|
|
|
|
pub enum EvidenceKind {
|
|
|
|
|
|
Exact,
|
|
|
|
|
|
Approx { b: u8, z: u8 },
|
|
|
|
|
|
}
|
2026-05-23 07:51:59 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
|
|
|
|
|
|
`find_approx` transparently. `find_exact` panics on an approximate layer;
|
|
|
|
|
|
`find_approx` panics on an exact layer — mode mixing is a programming error.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
---
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## Parameter resolution (`resolve_approx_params`)
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
|
|
|
|
|
|
`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
|
|
|
|
|
|
rounding:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
| given | derived |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| b, z | fp = 1/2^(b·z) |
|
|
|
|
|
|
| z, fp | b = ⌈−log₂(fp) / z⌉ |
|
|
|
|
|
|
| b, fp | z = ⌈−log₂(fp) / b⌉ |
|
|
|
|
|
|
| z only | b = 8 (default), fp derived |
|
|
|
|
|
|
| b only | z = 1 (default), fp derived |
|
|
|
|
|
|
| fp only | b = 8 (default), z derived |
|
|
|
|
|
|
| none | b = 8, z = 1, fp = 1/256 |
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
When all three are given, b and z are authoritative and fp is recomputed.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
---
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## CLI flags
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
Both `index` and `reindex` accept the same flags:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
| flag | type | meaning |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| `--approx` | bool | enable fingerprint evidence |
|
|
|
|
|
|
| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
|
|
|
|
|
|
| `-z` | u8 | Findere z parameter |
|
|
|
|
|
|
| `--fp` | f64 | target FP rate per z-window |
|
|
|
|
|
|
| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`--approx` must be set explicitly; the other three flags are optional and
|
|
|
|
|
|
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## `reindex` command
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`reindex` converts an existing index between exact and approximate evidence
|
|
|
|
|
|
in-place across all partitions and layers, running partitions in parallel via
|
|
|
|
|
|
Rayon.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
Conversion to approximate (`--approx`):
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
|
|
|
|
|
|
- Removes `evidence.bin` and `unitigs.bin.idx`.
|
|
|
|
|
|
- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
Conversion to exact (default, no `--approx`):
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
|
|
|
|
|
|
- Removes `fingerprint.bin`.
|
|
|
|
|
|
- Updates `layer_meta.json` with `EvidenceKind::Exact`.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
The root `index.meta` is updated with the new evidence kind on success.
|
|
|
|
|
|
`mphf.bin` and `unitigs.bin` are never modified.
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
## `estimate` command
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
`estimate` is a dry-run that resolves and prints (b, z, fp) without touching
|
|
|
|
|
|
any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
|
|
|
|
|
|
additionally accepts `-k` to display the effective indexed k-mer length:
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
```
|
|
|
|
|
|
k (query): 31
|
|
|
|
|
|
k (indexed): 31
|
|
|
|
|
|
z: 1
|
|
|
|
|
|
evidence bits (b): 8
|
|
|
|
|
|
FP per k-mer: 3.906e-3 (1/2^8)
|
|
|
|
|
|
FP per z-window: 3.906e-3 (1/2^8)
|
|
|
|
|
|
```
|
2026-05-23 07:51:59 +02:00
|
|
|
|
|
2026-05-23 13:24:25 +02:00
|
|
|
|
Useful for choosing parameters before committing to an index build.
|