Files
obikmer/docmd/implementation/evidence_elimination.md
T
Eric Coissac da56c3e290 docs: update architecture and storage specs for approximate index
Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:54:31 +02:00

5.3 KiB
Raw Blame History

Approximate evidence: fingerprint-based index

Motivation

evidence.bin maps each MPHF slot to the position of the k-mer that owns it, enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:

file size/partition fraction
evidence.bin 132 MB 66 %
unitigs.bin 58 MB 29 %
mphf.bin 10 MB 5 %

evidence.bin is a bijection from MPHF-space to unitig-position-space and costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with only ~22 % packing headroom. Compression is not a path to elimination.

The approximate index replaces evidence.bin + unitigs.bin.idx with a fingerprint.bin file. The MPHF and unitigs.bin are kept unchanged. Set operations still require an exact index; the approximate index targets query workloads that can tolerate a bounded false-positive rate.


The Findere model

A B-bit fingerprint stored per MPHF slot provides the discrimination that evidence.bin would otherwise provide through full k-mer reconstruction.

For a foreign k-mer query, the MPHF maps it to some slot s. The fingerprint stored at s belongs to the legitimate k-mer at that slot. The FP event is:

P(FP per k-mer) = 1 / 2^b

The Findere trick raises the effective window to z consecutive k-mers. A query succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:

P(FP per z-window) = 1 / 2^(b·z)

The effective indexed k-mer length is k z + 1: a query for a (k+z1)-mer decomposes into z overlapping k-mers, all of which must match.

Parameters b and z are stored in layer_meta.json (EvidenceKind::Approx { b, z }).


FingerprintVec on disk

fingerprint.bin layout:

magic:   b"FPVF"  (4 bytes)
b:       u8       (bits per slot, 1..=64)
padding: [0u8; 3]
n:       u64 LE   (number of slots)
data:    packed bits, ceil(n·b/8) bytes, Lsb0 order

FingerprintVec is memory-mapped. The match check against a query k-mer:

fn matches(&self, slot: usize, fingerprint: u64) -> bool {
    self.get(slot) == (fingerprint & self.mask)
}

build_approx_evidence iterates unitigs.bin sequentially, writes kmer.seq_hash() into the slot assigned by the MPHF, then saves fingerprint.bin and layer_meta.json. No .idx file is produced; random access into unitigs.bin is not needed.

At build time, find_approx in MphfLayer:

let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }

EvidenceKind and metadata

layer_meta.json records which evidence bundle is present:

pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}

MphfLayer::open reads this tag and dispatches find to find_exact or find_approx transparently. find_exact panics on an approximate layer; find_approx panics on an exact layer — mode mixing is a programming error.


Parameter resolution (resolve_approx_params)

The identity b·z = ⌈−log₂(fp)⌉ lets any two of (b, z, fp) derive the third. resolve_approx_params implements a 2-of-3 rule with conservative ceiling rounding:

given derived
b, z fp = 1/2^(b·z)
z, fp b = ⌈−log₂(fp) / z⌉
b, fp z = ⌈−log₂(fp) / b⌉
z only b = 8 (default), fp derived
b only z = 1 (default), fp derived
fp only b = 8 (default), z derived
none b = 8, z = 1, fp = 1/256

When all three are given, b and z are authoritative and fp is recomputed.


CLI flags

Both index and reindex accept the same flags:

flag type meaning
--approx bool enable fingerprint evidence
--evidence-bits (b) u8 fingerprint bits per slot
-z u8 Findere z parameter
--fp f64 target FP rate per z-window
--block-size usize unitig block size for exact .idx; ignored in approx mode

--approx must be set explicitly; the other three flags are optional and resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.


reindex command

reindex converts an existing index between exact and approximate evidence in-place across all partitions and layers, running partitions in parallel via Rayon.

Conversion to approximate (--approx):

  • Builds fingerprint.bin from unitigs.bin + mphf.bin.
  • Removes evidence.bin and unitigs.bin.idx.
  • Updates layer_meta.json with EvidenceKind::Approx { b, z }.

Conversion to exact (default, no --approx):

  • Builds evidence.bin + unitigs.bin.idx from unitigs.bin + mphf.bin.
  • Removes fingerprint.bin.
  • Updates layer_meta.json with EvidenceKind::Exact.

The root index.meta is updated with the new evidence kind on success. mphf.bin and unitigs.bin are never modified.


estimate command

estimate is a dry-run that resolves and prints (b, z, fp) without touching any index. It accepts the same --evidence-bits, -z, and --fp flags and additionally accepts -k to display the effective indexed k-mer length:

k (query):             31
k (indexed):           31
z:                     1
evidence bits (b):     8
FP per k-mer:          3.906e-3  (1/2^8)
FP per z-window:       3.906e-3  (1/2^8)

Useful for choosing parameters before committing to an index build.