Files
Eric Coissac 8a0b898b4b docs: clarify query pipeline, Findere trick, and input formats
Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
2026-05-30 15:59:12 +02:00

5.6 KiB
Raw Permalink Blame History

Approximate evidence: fingerprint-based index

Motivation

evidence.bin maps each MPHF slot to the position of the k-mer that owns it, enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:

file size/partition fraction
evidence.bin 132 MB 66 %
unitigs.bin 58 MB 29 %
mphf.bin 10 MB 5 %

evidence.bin is a bijection from MPHF-space to unitig-position-space and costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with only ~22 % packing headroom. Compression is not a path to elimination.

The approximate index replaces evidence.bin + unitigs.bin.idx with a fingerprint.bin file. The MPHF and unitigs.bin are kept unchanged. Set operations still require an exact index; the approximate index targets query workloads that can tolerate a bounded false-positive rate.


The Findere model

A B-bit fingerprint stored per MPHF slot provides the discrimination that evidence.bin would otherwise provide through full k-mer reconstruction.

For a foreign k-mer query, the MPHF maps it to some slot s. The fingerprint stored at s belongs to the legitimate k-mer at that slot. The FP event is:

P(FP per k-mer) = 1 / 2^b

The Findere trick reduces the indexed k-mer size. When the user specifies k_user and z, the index physically stores k-mers of size s = k_user z + 1. At query time, the same s-mer size is used. After collecting per-position s-mer results over the full query sequence, a sliding window of size z aggregates z consecutive s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:

P(FP per k_user-mer) = 1 / 2^(b·z)

IndexConfig::kmer_size stores s = k_user z + 1, not k_user. Both indexing and querying use this stored size via set_k(idx.kmer_size()).

Parameters b and z are stored in layer_meta.json (EvidenceKind::Approx { b, z }).


FingerprintVec on disk

fingerprint.bin layout:

magic:   b"FPVF"  (4 bytes)
b:       u8       (bits per slot, 1..=64)
padding: [0u8; 3]
n:       u64 LE   (number of slots)
data:    packed bits, ceil(n·b/8) bytes, Lsb0 order

FingerprintVec is memory-mapped. The match check against a query k-mer:

fn matches(&self, slot: usize, fingerprint: u64) -> bool {
    self.get(slot) == (fingerprint & self.mask)
}

build_approx_evidence iterates unitigs.bin sequentially, writes kmer.seq_hash() into the slot assigned by the MPHF, then saves fingerprint.bin and layer_meta.json. No .idx file is produced; random access into unitigs.bin is not needed.

At build time, find_approx in MphfLayer:

let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }

EvidenceKind and metadata

layer_meta.json records which evidence bundle is present:

pub enum EvidenceKind {
    Exact,
    Approx { b: u8, z: u8 },
}

MphfLayer::open reads this tag and dispatches find to find_exact or find_approx transparently. find_exact panics on an approximate layer; find_approx panics on an exact layer — mode mixing is a programming error.


Parameter resolution (resolve_approx_params)

The identity b·z = ⌈−log₂(fp)⌉ lets any two of (b, z, fp) derive the third. resolve_approx_params implements a 2-of-3 rule with conservative ceiling rounding:

given derived
b, z fp = 1/2^(b·z)
z, fp b = ⌈−log₂(fp) / z⌉
b, fp z = ⌈−log₂(fp) / b⌉
z only b = 8 (default), fp derived
b only z = 1 (default), fp derived
fp only b = 8 (default), z derived
none b = 8, z = 1, fp = 1/256

When all three are given, b and z are authoritative and fp is recomputed.


CLI flags

Both index and reindex accept the same flags:

flag type meaning
--approx bool enable fingerprint evidence
--evidence-bits (b) u8 fingerprint bits per slot
-z u8 Findere z parameter
--fp f64 target FP rate per z-window
--block-size usize unitig block size for exact .idx; ignored in approx mode

--approx must be set explicitly; the other three flags are optional and resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.


reindex command

reindex converts an existing index between exact and approximate evidence in-place across all partitions and layers, running partitions in parallel via Rayon.

Conversion to approximate (--approx):

  • Builds fingerprint.bin from unitigs.bin + mphf.bin.
  • Removes evidence.bin and unitigs.bin.idx.
  • Updates layer_meta.json with EvidenceKind::Approx { b, z }.

Conversion to exact (default, no --approx):

  • Builds evidence.bin + unitigs.bin.idx from unitigs.bin + mphf.bin.
  • Removes fingerprint.bin.
  • Updates layer_meta.json with EvidenceKind::Exact.

The root index.meta is updated with the new evidence kind on success. mphf.bin and unitigs.bin are never modified.


estimate command

estimate is a dry-run that resolves and prints (b, z, fp) without touching any index. It accepts the same --evidence-bits, -z, and --fp flags and additionally accepts -k to display the effective indexed k-mer length:

k (user):              31
k (indexed, s=k-z+1):  27
z:                     5
evidence bits (b):     8
FP per s-mer:          3.906e-3  (1/2^8)
FP per k-mer window:   9.537e-7  (1/2^(8·5))

Useful for choosing parameters before committing to an index build.