Files

182 lines
5.6 KiB
Markdown
Raw Permalink Normal View History

# Approximate evidence: fingerprint-based index
## Motivation
`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
| file | size/partition | fraction |
|---|---|---|
| evidence.bin | 132 MB | 66 % |
| unitigs.bin | 58 MB | 29 % |
| mphf.bin | 10 MB | 5 % |
`evidence.bin` is a bijection from MPHF-space to unitig-position-space and
costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
only ~22 % packing headroom. Compression is not a path to elimination.
The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
operations still require an exact index; the approximate index targets query
workloads that can tolerate a bounded false-positive rate.
---
## The Findere model
A B-bit fingerprint stored per MPHF slot provides the discrimination that
`evidence.bin` would otherwise provide through full k-mer reconstruction.
For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
```
P(FP per k-mer) = 1 / 2^b
```
The Findere trick reduces the indexed k-mer size. When the user specifies k_user
and z, the index physically stores k-mers of size `s = k_user z + 1`. At query
time, the same s-mer size is used. After collecting per-position s-mer results
over the full query sequence, a sliding window of size z aggregates z consecutive
s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:
```
P(FP per k_user-mer) = 1 / 2^(b·z)
```
`IndexConfig::kmer_size` stores `s = k_user z + 1`, not k_user. Both indexing
and querying use this stored size via `set_k(idx.kmer_size())`.
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
---
## `FingerprintVec` on disk
`fingerprint.bin` layout:
```
magic: b"FPVF" (4 bytes)
b: u8 (bits per slot, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
```
`FingerprintVec` is memory-mapped. The match check against a query k-mer:
```rust
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
self.get(slot) == (fingerprint & self.mask)
}
```
`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
and `layer_meta.json`. No `.idx` file is produced; random access into
`unitigs.bin` is not needed.
At build time, `find_approx` in `MphfLayer`:
```rust
let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
```
---
## `EvidenceKind` and metadata
`layer_meta.json` records which evidence bundle is present:
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
`find_approx` transparently. `find_exact` panics on an approximate layer;
`find_approx` panics on an exact layer — mode mixing is a programming error.
---
## Parameter resolution (`resolve_approx_params`)
The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
rounding:
| given | derived |
|---|---|
| b, z | fp = 1/2^(b·z) |
| z, fp | b = ⌈−log₂(fp) / z⌉ |
| b, fp | z = ⌈−log₂(fp) / b⌉ |
| z only | b = 8 (default), fp derived |
| b only | z = 1 (default), fp derived |
| fp only | b = 8 (default), z derived |
| none | b = 8, z = 1, fp = 1/256 |
When all three are given, b and z are authoritative and fp is recomputed.
---
## CLI flags
Both `index` and `reindex` accept the same flags:
| flag | type | meaning |
|---|---|---|
| `--approx` | bool | enable fingerprint evidence |
| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
| `-z` | u8 | Findere z parameter |
| `--fp` | f64 | target FP rate per z-window |
| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |
`--approx` must be set explicitly; the other three flags are optional and
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
---
## `reindex` command
`reindex` converts an existing index between exact and approximate evidence
in-place across all partitions and layers, running partitions in parallel via
Rayon.
Conversion to approximate (`--approx`):
- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
- Removes `evidence.bin` and `unitigs.bin.idx`.
- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.
Conversion to exact (default, no `--approx`):
- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
- Removes `fingerprint.bin`.
- Updates `layer_meta.json` with `EvidenceKind::Exact`.
The root `index.meta` is updated with the new evidence kind on success.
`mphf.bin` and `unitigs.bin` are never modified.
---
## `estimate` command
`estimate` is a dry-run that resolves and prints (b, z, fp) without touching
any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
additionally accepts `-k` to display the effective indexed k-mer length:
```
k (user): 31
k (indexed, s=k-z+1): 27
z: 5
evidence bits (b): 8
FP per s-mer: 3.906e-3 (1/2^8)
FP per k-mer window: 9.537e-7 (1/2^(8·5))
```
Useful for choosing parameters before committing to an index build.