Files

T

Eric Coissac 4a5ab0b8c2 feat: optimize unitig index and document evidence elimination

Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.

2026-05-23 12:53:42 +02:00

12 KiB

Raw Blame History

Evidence elimination — design discussion

Problem statement

evidence.bin maps each MPHF slot to a position in the unitig store so that query verification is possible: given a slot s returned by mphf.index(kmer), retrieve the k-mer stored at that position and compare with the query.

On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition):

file	size/partition	total (2048 parts)	fraction of lookup layer
evidence.bin	132 MB	~270 GB	66 %
unitigs.bin	58 MB	~118 GB	29 %
mphf.bin	10 MB	~20 GB	5 %

Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage optimisation available for index size.

Why evidence exists

PtrHash (like all standard MPHFs) maps any input to a valid slot in [0, n). For a query k-mer not in the indexed set, the returned slot is meaningless but indistinguishable from a real hit without external information. Evidence provides that information: evidence[s] encodes the location of the k-mer that legitimately occupies slot s, allowing the verification:

slot = mphf.index(query)
(chunk_id, rank) = evidence.decode(slot)
stored_kmer = unitigs.kmer_at(chunk_id, rank)
return canonical(stored_kmer) == canonical(query)

Evidence is a permutation from MPHF-space to unitig-position-space. Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.

Information-theoretic lower bound

For a partition with P k-mers and U unitigs of average length m_u k-mers:

global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
(chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits

Current implementation: 25 + 7 = 32 bits (aligned u32).
Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.

Packing headroom: ~22 %. Not a path to elimination.

Two-index architecture

The exact index is mandatory for set operations (union, intersection, diff) and exact k-mer retrieval. A separate approximate index, built for query operations, can tolerate a controlled false positive rate in exchange for a much smaller footprint.

component	exact index	approximate index
`mphf.bin`	✓	✓ (same structure)
`evidence.bin`	✓ (32 bits/k-mer)	✗
`fingerprint.bin`	✗	✓ (B bits/k-mer)
`unitigs.bin`	✓	✓ (K-mer enumeration)
`unitigs.bin.idx`	✓	✗ (random access not needed)

The approximate index drops evidence.bin and unitigs.bin.idx; it keeps unitigs.bin for sequential enumeration of K-mers.

MPHF as a perfect Bloom filter

A standard Bloom filter with a single hash function and N bits storing M keys has occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what rejects foreign k-mers.

An MPHF is a Bloom filter with zero internal collisions: every indexed k-mer occupies its own unique slot. But unlike a Bloom filter, the MPHF maps any input to a slot in [0, M) — there is no empty space. Every query lands on an occupied slot. The MPHF alone cannot reject foreign k-mers at all.

Adding a B-bit fingerprint restores the discrimination:

slot        = mphf.index(query)
fingerprint = hash(query) & mask_B
present     = fingerprint_table[slot] == fingerprint

The fingerprint plays the role of the sparse space in the Bloom filter: it provides the B bits of information needed to reject foreign k-mers.

Both structures reach the same fundamental cost for a given FP rate. For 1% FP:

Bloom filter (optimal, k hash functions): ~9.6 bits/key
MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key

This is a fundamental bound, not an implementation detail.

Approach A — MPHF + fingerprint (approximate index)

Size

B (bits)	fingerprint.bin/partition	vs evidence.bin (32 bits)
8	33 MB	4× smaller
12	49 MB	2.7× smaller
16	66 MB	2× smaller

Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer).

False positive rate — per k-mer query

For a specific non-indexed query k-mer q:

MPHF(q) → slot s, some value in [0, M)
fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
FP event: hash(q) & mask_B == fingerprint_table[s]

Since q is not the legitimate k-mer at s, its fingerprint is independent of fingerprint_table[s], giving:

P(FP per k-mer) = 1 / 2^B

This is the probability of error for one specific query k-mer. It is not the fraction of the k-mer universe that would be misclassified: querying all 4^k possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but that is not the relevant quantity for practical use.

Equivalence classes

The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions) produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B remains constant.

Read-level use case

The relevant decision unit is the read, not the individual k-mer. For a read of ~100 nucleotides and k=31, there are ~70 k-mers.

A bacterial read queried against a bacterial index: nearly all ~70 k-mers are true positives → high coverage fraction.
A plant read queried against a bacterial index: k-mers are foreign; each has P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.

A coverage threshold separates the two cases decisively. This is the same principle as Findere: local coverage continuity distinguishes true hits from noise.

Approach B — z-consecutive k-mer matching

A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers. Declaring a match only when all z are present reduces the per-window FP rate:

P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)

For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z. The probability that at least one window is a false positive:

P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)

For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸.

A plant read is misclassified as bacterial roughly once in 60 million reads — negligible for any practical dataset.

Choosing B from (z, L, P_target)

z is a query-time parameter and does not affect the index structure. However, knowing z at build time allows computing the minimum B required to reach a target FP rate P_target for reads of length L (giving W = L − k − z + 2 independent windows):

P_target ≈ W / 2^(B·z)  →  B = ceil( (log2(W) - log2(P_target)) / z )

Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = 9 bits.

(B, z) are co-determined at build time to minimise fingerprint size while guaranteeing the target read-level FP rate.

Combined sizing

B	z	K = k+z−1	P(FP_read)	fingerprint.bin/partition
8	2	32	~67/2^16 ≈ 10⁻³	33 MB
8	4	34	~67/2^32 ≈ 10⁻⁸	33 MB
4	4	34	~67/2^16 ≈ 10⁻³	16 MB
4	8	38	~63/2^32 ≈ 10⁻⁸	16 MB

Smaller B → smaller fingerprint table; larger z → longer minimum match length K and fewer independent windows per read.

Approach 1 — value-based MPHF (eliminates evidence.bin from exact index)

Build the MPHF to output the global k-mer position directly:

mphf: kmer → global_pos ∈ [0, P)

Verification becomes:

global_pos = mphf.index(query)
stored_kmer = unitigs.kmer_at_global_pos(global_pos)
return canonical(stored_kmer) == canonical(query)

No evidence array. The unitig block index (see below) provides kmer_at_global_pos in O(log(n_blocks) + BLOCK_SIZE) time.

What is required

A retrieval data structure (also called a value-based or function-based MPHF): given a set of (key, value) pairs with distinct keys and bijective values in [0, n), build a compact structure that maps each key to its assigned value.

Known constructions:

GOV / GBF (Generalized Bloomier Filter): random 3-uniform hypergraph + XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic minimum. Construction: O(n). Query: O(1).
SSHash approach: builds the MPHF to map k-mers to their positions in a concatenated unitig string. Achieves elimination of external evidence using a "skew" construction that aligns the MPHF output with the sequential unitig layout.

Rust availability

No Rust crate implements a retrieval data structure suitable for this use case as of 2025. The ph, boomphf, fmphf, and ptr_hash crates all build plain MPHFs. This is the key blocking factor.

SSHash construction (reference)

SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over (minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential positions in the concatenated unitig string. A port to obikmer would require:

Concatenating all unitig sequences into a single flat buffer per partition.
Assigning each k-mer a global position (its offset in that buffer).
Building the MPHF to output that position directly (retrieval step).
Replacing evidence.bin with a small prefix-sum index for kmer_at_global_pos.

Approach 2 — block index prefix sums (reduces evidence to negligible)

A prerequisite already implemented: unitigs.bin.idx now uses a block-sampled offset index (one u32 per BLOCK_SIZE=64 chunks) instead of a per-chunk offset table.

Extension: k-mer prefix sums per block

Add a second array to unitigs.bin.idx: kmer_prefix[b] = total k-mers before block b. For 33 M k-mers: ~73 600 blocks × 4 bytes = 295 KB/partition.

This enables kmer_at_global_pos(p):

Binary search in kmer_prefix[] to find block b.
Sequential scan from block_offsets[b] until cumulative k-mer count reaches p.
Extract the k-mer at the remaining rank within the found chunk.

Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.

Combined with Approach 1

evidence.bin: eliminated (~270 GB saved across 2048 partitions)
kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)

Recommended path

Short term (approximate index): implement MPHF + fingerprint.bin. Choose (B, z) as index parameters. Drop evidence.bin and unitigs.bin.idx; keep unitigs.bin for K-mer enumeration. Expected size: ~43 MB/partition at B=8 vs ~142 MB for the exact lookup layer.
Short term (exact index): add kmer_prefix[] to unitigs.bin.idx. Zero cost if evidence.bin is kept; enables Approach 1 when ready.
Medium term: implement GOV retrieval data structure in Rust, or port SSHash construction.
Long term: replace evidence.bin with the value-based MPHF. Expected index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.

Open questions

Is a GOV construction compatible with the parallel MPHF build currently used (PtrHash's new_from_par_iter)? GOV construction is inherently sequential (hypergraph peeling); parallelisation is non-trivial.
Can the SSHash "skew" insight be reused without the minimizer-bucket structure? The obikmer partitioning already uses minimizers — there may be natural alignment.
What is the query latency impact of replacing O(1) evidence lookup with O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale.
What is the optimal (B, z) trade-off for the approximate index given the target read length and acceptable P(FP_read)?

12 KiB Raw Blame History Unescape Escape