# Approximate evidence: fingerprint-based index ## Motivation `evidence.bin` maps each MPHF slot to the position of the k-mer that owns it, enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint: | file | size/partition | fraction | |---|---|---| | evidence.bin | 132 MB | 66 % | | unitigs.bin | 58 MB | 29 % | | mphf.bin | 10 MB | 5 % | `evidence.bin` is a bijection from MPHF-space to unitig-position-space and costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with only ~22 % packing headroom. Compression is not a path to elimination. The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a `fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set operations still require an exact index; the approximate index targets query workloads that can tolerate a bounded false-positive rate. --- ## The Findere model A B-bit fingerprint stored per MPHF slot provides the discrimination that `evidence.bin` would otherwise provide through full k-mer reconstruction. For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint stored at `s` belongs to the legitimate k-mer at that slot. The FP event is: ``` P(FP per k-mer) = 1 / 2^b ``` The Findere trick raises the effective window to z consecutive k-mers. A query succeeds only when all z fingerprint checks pass, reducing the per-window FP rate: ``` P(FP per z-window) = 1 / 2^(b·z) ``` The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer decomposes into z overlapping k-mers, all of which must match. Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`). --- ## `FingerprintVec` on disk `fingerprint.bin` layout: ``` magic: b"FPVF" (4 bytes) b: u8 (bits per slot, 1..=64) padding: [0u8; 3] n: u64 LE (number of slots) data: packed bits, ceil(n·b/8) bytes, Lsb0 order ``` `FingerprintVec` is memory-mapped. The match check against a query k-mer: ```rust fn matches(&self, slot: usize, fingerprint: u64) -> bool { self.get(slot) == (fingerprint & self.mask) } ``` `build_approx_evidence` iterates `unitigs.bin` sequentially, writes `kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin` and `layer_meta.json`. No `.idx` file is produced; random access into `unitigs.bin` is not needed. At build time, `find_approx` in `MphfLayer`: ```rust let slot = self.mphf.index(&kmer.raw()); if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } ``` --- ## `EvidenceKind` and metadata `layer_meta.json` records which evidence bundle is present: ```rust pub enum EvidenceKind { Exact, Approx { b: u8, z: u8 }, } ``` `MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or `find_approx` transparently. `find_exact` panics on an approximate layer; `find_approx` panics on an exact layer — mode mixing is a programming error. --- ## Parameter resolution (`resolve_approx_params`) The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third. `resolve_approx_params` implements a 2-of-3 rule with conservative ceiling rounding: | given | derived | |---|---| | b, z | fp = 1/2^(b·z) | | z, fp | b = ⌈−log₂(fp) / z⌉ | | b, fp | z = ⌈−log₂(fp) / b⌉ | | z only | b = 8 (default), fp derived | | b only | z = 1 (default), fp derived | | fp only | b = 8 (default), z derived | | none | b = 8, z = 1, fp = 1/256 | When all three are given, b and z are authoritative and fp is recomputed. --- ## CLI flags Both `index` and `reindex` accept the same flags: | flag | type | meaning | |---|---|---| | `--approx` | bool | enable fingerprint evidence | | `--evidence-bits` (`b`) | u8 | fingerprint bits per slot | | `-z` | u8 | Findere z parameter | | `--fp` | f64 | target FP rate per z-window | | `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode | `--approx` must be set explicitly; the other three flags are optional and resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1. --- ## `reindex` command `reindex` converts an existing index between exact and approximate evidence in-place across all partitions and layers, running partitions in parallel via Rayon. Conversion to approximate (`--approx`): - Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. - Removes `evidence.bin` and `unitigs.bin.idx`. - Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`. Conversion to exact (default, no `--approx`): - Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. - Removes `fingerprint.bin`. - Updates `layer_meta.json` with `EvidenceKind::Exact`. The root `index.meta` is updated with the new evidence kind on success. `mphf.bin` and `unitigs.bin` are never modified. --- ## `estimate` command `estimate` is a dry-run that resolves and prints (b, z, fp) without touching any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and additionally accepts `-k` to display the effective indexed k-mer length: ``` k (query): 31 k (indexed): 31 z: 1 evidence bits (b): 8 FP per k-mer: 3.906e-3 (1/2^8) FP per z-window: 3.906e-3 (1/2^8) ``` Useful for choosing parameters before committing to an index build.