Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
5.6 KiB
Approximate evidence: fingerprint-based index
Motivation
evidence.bin maps each MPHF slot to the position of the k-mer that owns it,
enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
| file | size/partition | fraction |
|---|---|---|
| evidence.bin | 132 MB | 66 % |
| unitigs.bin | 58 MB | 29 % |
| mphf.bin | 10 MB | 5 % |
evidence.bin is a bijection from MPHF-space to unitig-position-space and
costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
only ~22 % packing headroom. Compression is not a path to elimination.
The approximate index replaces evidence.bin + unitigs.bin.idx with a
fingerprint.bin file. The MPHF and unitigs.bin are kept unchanged. Set
operations still require an exact index; the approximate index targets query
workloads that can tolerate a bounded false-positive rate.
The Findere model
A B-bit fingerprint stored per MPHF slot provides the discrimination that
evidence.bin would otherwise provide through full k-mer reconstruction.
For a foreign k-mer query, the MPHF maps it to some slot s. The fingerprint
stored at s belongs to the legitimate k-mer at that slot. The FP event is:
P(FP per k-mer) = 1 / 2^b
The Findere trick reduces the indexed k-mer size. When the user specifies k_user
and z, the index physically stores k-mers of size s = k_user − z + 1. At query
time, the same s-mer size is used. After collecting per-position s-mer results
over the full query sequence, a sliding window of size z aggregates z consecutive
s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:
P(FP per k_user-mer) = 1 / 2^(b·z)
IndexConfig::kmer_size stores s = k_user − z + 1, not k_user. Both indexing
and querying use this stored size via set_k(idx.kmer_size()).
Parameters b and z are stored in layer_meta.json (EvidenceKind::Approx { b, z }).
FingerprintVec on disk
fingerprint.bin layout:
magic: b"FPVF" (4 bytes)
b: u8 (bits per slot, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
FingerprintVec is memory-mapped. The match check against a query k-mer:
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
self.get(slot) == (fingerprint & self.mask)
}
build_approx_evidence iterates unitigs.bin sequentially, writes
kmer.seq_hash() into the slot assigned by the MPHF, then saves fingerprint.bin
and layer_meta.json. No .idx file is produced; random access into
unitigs.bin is not needed.
At build time, find_approx in MphfLayer:
let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
EvidenceKind and metadata
layer_meta.json records which evidence bundle is present:
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
MphfLayer::open reads this tag and dispatches find to find_exact or
find_approx transparently. find_exact panics on an approximate layer;
find_approx panics on an exact layer — mode mixing is a programming error.
Parameter resolution (resolve_approx_params)
The identity b·z = ⌈−log₂(fp)⌉ lets any two of (b, z, fp) derive the third.
resolve_approx_params implements a 2-of-3 rule with conservative ceiling
rounding:
| given | derived |
|---|---|
| b, z | fp = 1/2^(b·z) |
| z, fp | b = ⌈−log₂(fp) / z⌉ |
| b, fp | z = ⌈−log₂(fp) / b⌉ |
| z only | b = 8 (default), fp derived |
| b only | z = 1 (default), fp derived |
| fp only | b = 8 (default), z derived |
| none | b = 8, z = 1, fp = 1/256 |
When all three are given, b and z are authoritative and fp is recomputed.
CLI flags
Both index and reindex accept the same flags:
| flag | type | meaning |
|---|---|---|
--approx |
bool | enable fingerprint evidence |
--evidence-bits (b) |
u8 | fingerprint bits per slot |
-z |
u8 | Findere z parameter |
--fp |
f64 | target FP rate per z-window |
--block-size |
usize | unitig block size for exact .idx; ignored in approx mode |
--approx must be set explicitly; the other three flags are optional and
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
reindex command
reindex converts an existing index between exact and approximate evidence
in-place across all partitions and layers, running partitions in parallel via
Rayon.
Conversion to approximate (--approx):
- Builds
fingerprint.binfromunitigs.bin+mphf.bin. - Removes
evidence.binandunitigs.bin.idx. - Updates
layer_meta.jsonwithEvidenceKind::Approx { b, z }.
Conversion to exact (default, no --approx):
- Builds
evidence.bin+unitigs.bin.idxfromunitigs.bin+mphf.bin. - Removes
fingerprint.bin. - Updates
layer_meta.jsonwithEvidenceKind::Exact.
The root index.meta is updated with the new evidence kind on success.
mphf.bin and unitigs.bin are never modified.
estimate command
estimate is a dry-run that resolves and prints (b, z, fp) without touching
any index. It accepts the same --evidence-bits, -z, and --fp flags and
additionally accepts -k to display the effective indexed k-mer length:
k (user): 31
k (indexed, s=k-z+1): 27
z: 5
evidence bits (b): 8
FP per s-mer: 3.906e-3 (1/2^8)
FP per k-mer window: 9.537e-7 (1/2^(8·5))
Useful for choosing parameters before committing to an index build.