docs: update architecture and storage specs for approximate index

Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
2026-05-23 13:24:25 +02:00
parent b7db3a33ed
commit da56c3e290
5 changed files with 814 additions and 882 deletions
@@ -1,321 +1,178 @@
-# Evidence elimination — design discussion
+# Approximate evidence: fingerprint-based index

-## Problem statement
+## Motivation

-`evidence.bin` maps each MPHF slot to a position in the unitig store so that
-query verification is possible: given a slot `s` returned by `mphf.index(kmer)`,
-retrieve the k-mer stored at that position and compare with the query.
+`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
+enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
+k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:

-On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition):
-
-| file | size/partition | total (2048 parts) | fraction of lookup layer |
-|---|---|---|---|
-| evidence.bin | 132 MB | ~270 GB | **66 %** |
-| unitigs.bin | 58 MB | ~118 GB | 29 % |
-| mphf.bin | 10 MB | ~20 GB | 5 % |
-
-Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage
-optimisation available for index size.
-
---
-
-## Why evidence exists
-
-PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`.
-For a query k-mer not in the indexed set, the returned slot is meaningless but
-indistinguishable from a real hit without external information.  Evidence provides
-that information: `evidence[s]` encodes the location of the k-mer that legitimately
-occupies slot `s`, allowing the verification:
-
-```
-slot = mphf.index(query)
-(chunk_id, rank) = evidence.decode(slot)
-stored_kmer = unitigs.kmer_at(chunk_id, rank)
-return canonical(stored_kmer) == canonical(query)
-```
-
-Evidence is a **permutation** from MPHF-space to unitig-position-space.
-Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.
-
---
-
-## Information-theoretic lower bound
-
-For a partition with P k-mers and U unitigs of average length m_u k-mers:
-
- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits
-
-Current implementation: 25 + 7 = 32 bits (aligned u32).  
-Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.
-
-**Packing headroom: ~22 %.** Not a path to elimination.
-
---
-
-## Two-index architecture
-
-The exact index is mandatory for set operations (union, intersection, diff) and
-exact k-mer retrieval.  A separate approximate index, built for query operations,
-can tolerate a controlled false positive rate in exchange for a much smaller
-footprint.
-
-| component | exact index | approximate index |
+| file | size/partition | fraction |
 |---|---|---|
-| `mphf.bin` | ✓ | ✓ (same structure) |
-| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ |
-| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) |
-| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) |
-| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) |
+| evidence.bin | 132 MB | 66 % |
+| unitigs.bin | 58 MB | 29 % |
+| mphf.bin | 10 MB | 5 % |

-The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps
-`unitigs.bin` for sequential enumeration of K-mers.
+`evidence.bin` is a bijection from MPHF-space to unitig-position-space and
+costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
+only ~22 % packing headroom. Compression is not a path to elimination.
+
+The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
+`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
+operations still require an exact index; the approximate index targets query
+workloads that can tolerate a bounded false-positive rate.

 ---

-## MPHF as a perfect Bloom filter
+## The Findere model

-A standard Bloom filter with a single hash function and N bits storing M keys has
-occupancy M/N.  For a foreign query k-mer, P(FP) = M/N — the probability of
-landing on a set bit.  The empty space (fraction 1 − M/N of bits at 0) is what
-rejects foreign k-mers.
+A B-bit fingerprint stored per MPHF slot provides the discrimination that
+`evidence.bin` would otherwise provide through full k-mer reconstruction.

-An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer
-occupies its own unique slot.  But unlike a Bloom filter, the MPHF maps **any**
-input to a slot in [0, M) — there is no empty space.  Every query lands on an
-occupied slot.  The MPHF alone cannot reject foreign k-mers at all.
-
-Adding a B-bit fingerprint restores the discrimination:
+For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
+stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:

 ```
-slot        = mphf.index(query)
-fingerprint = hash(query) & mask_B
-present     = fingerprint_table[slot] == fingerprint
+P(FP per k-mer) = 1 / 2^b
 ```

-The fingerprint plays the role of the sparse space in the Bloom filter: it provides
-the B bits of information needed to reject foreign k-mers.
+The Findere trick raises the effective window to z consecutive k-mers. A query
+succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:

-Both structures reach the same fundamental cost for a given FP rate.  For 1% FP:
+```
+P(FP per z-window) = 1 / 2^(b·z)
+```

- Bloom filter (optimal, k hash functions): ~9.6 bits/key
- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key
+The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer
+decomposes into z overlapping k-mers, all of which must match.

-This is a fundamental bound, not an implementation detail.
+Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).

 ---

-## Approach A — MPHF + fingerprint (approximate index)
+## `FingerprintVec` on disk

-### Size
+`fingerprint.bin` layout:

-| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) |
+```
+magic:   b"FPVF"  (4 bytes)
+b:       u8       (bits per slot, 1..=64)
+padding: [0u8; 3]
+n:       u64 LE   (number of slots)
+data:    packed bits, ceil(n·b/8) bytes, Lsb0 order
+```
+
+`FingerprintVec` is memory-mapped. The match check against a query k-mer:
+
+```rust
+fn matches(&self, slot: usize, fingerprint: u64) -> bool {
+    self.get(slot) == (fingerprint & self.mask)
+}
+```
+
+`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
+`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
+and `layer_meta.json`. No `.idx` file is produced; random access into
+`unitigs.bin` is not needed.
+
+At build time, `find_approx` in `MphfLayer`:
+
+```rust
+let slot = self.mphf.index(&kmer.raw());
+if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
+```
+
+---
+
+## `EvidenceKind` and metadata
+
+`layer_meta.json` records which evidence bundle is present:
+
+```rust
+pub enum EvidenceKind {
+    Exact,
+    Approx { b: u8, z: u8 },
+}
+```
+
+`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
+`find_approx` transparently. `find_exact` panics on an approximate layer;
+`find_approx` panics on an exact layer — mode mixing is a programming error.
+
+---
+
+## Parameter resolution (`resolve_approx_params`)
+
+The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
+`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
+rounding:
+
+| given | derived |
+|---|---|
+| b, z | fp = 1/2^(b·z) |
+| z, fp | b = ⌈−log₂(fp) / z⌉ |
+| b, fp | z = ⌈−log₂(fp) / b⌉ |
+| z only | b = 8 (default), fp derived |
+| b only | z = 1 (default), fp derived |
+| fp only | b = 8 (default), z derived |
+| none | b = 8, z = 1, fp = 1/256 |
+
+When all three are given, b and z are authoritative and fp is recomputed.
+
+---
+
+## CLI flags
+
+Both `index` and `reindex` accept the same flags:
+
+| flag | type | meaning |
 |---|---|---|
-| 8 | 33 MB | 4× smaller |
-| 12 | 49 MB | 2.7× smaller |
-| 16 | 66 MB | 2× smaller |
+| `--approx` | bool | enable fingerprint evidence |
+| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
+| `-z` | u8 | Findere z parameter |
+| `--fp` | f64 | target FP rate per z-window |
+| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |

-Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer).
-
-### False positive rate — per k-mer query
-
-For a specific non-indexed query k-mer q:
-
-1. MPHF(q) → slot s, some value in [0, M)
-2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
-3. FP event: hash(q) & mask_B == fingerprint_table[s]
-
-Since q is not the legitimate k-mer at s, its fingerprint is independent of
-fingerprint_table[s], giving:
-
-```
-P(FP per k-mer) = 1 / 2^B
-```
-
-This is the probability of error **for one specific query k-mer**.  It is not the
-fraction of the k-mer universe that would be misclassified: querying all 4^k
-possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but
-that is not the relevant quantity for practical use.
-
-### Equivalence classes
-
-The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B
-equivalence classes of average size 4^k/(M·2^B).  Each class contains 1 true
-indexed k-mer and 4^k/(M·2^B) − 1 false positives.  A larger M (fewer partitions)
-produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B
-remains constant.
-
-### Read-level use case
-
-The relevant decision unit is the **read**, not the individual k-mer.  For a read
-of ~100 nucleotides and k=31, there are ~70 k-mers.
-
- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are
-  true positives → high coverage fraction.
- A plant read queried against a bacterial index: k-mers are foreign; each has
-  P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.
-
-A coverage threshold separates the two cases decisively.  This is the same
-principle as Findere: local coverage continuity distinguishes true hits from noise.
+`--approx` must be set explicitly; the other three flags are optional and
+resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.

 ---

-## Approach B — z-consecutive k-mer matching
+## `reindex` command

-A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers.
-Declaring a match only when **all z are present** reduces the per-window FP rate:
+`reindex` converts an existing index between exact and approximate evidence
+in-place across all partitions and layers, running partitions in parallel via
+Rayon.

-```
-P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)
-```
+Conversion to approximate (`--approx`):

-For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z.
-The probability that at least one window is a false positive:
+- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
+- Removes `evidence.bin` and `unitigs.bin.idx`.
+- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.

-```
-P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)
-```
+Conversion to exact (default, no `--approx`):

-For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸.
+- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
+- Removes `fingerprint.bin`.
+- Updates `layer_meta.json` with `EvidenceKind::Exact`.

-A plant read is misclassified as bacterial roughly once in 60 million reads —
-negligible for any practical dataset.
-
-### Choosing B from (z, L, P_target)
-
-z is a query-time parameter and does not affect the index structure.  However,
-knowing z at build time allows computing the minimum B required to reach a target
-FP rate P_target for reads of length L (giving W = L − k − z + 2 independent
-windows):
-
-```
-P_target ≈ W / 2^(B·z)  →  B = ceil( (log2(W) - log2(P_target)) / z )
-```
-
-Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**.
-
-(B, z) are co-determined at build time to minimise fingerprint size while
-guaranteeing the target read-level FP rate.
-
-### Combined sizing
-
-| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition |
-|---|---|---|---|---|
-| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB |
-| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB |
-| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB |
-| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB |
-
-Smaller B → smaller fingerprint table; larger z → longer minimum match length K
-and fewer independent windows per read.
+The root `index.meta` is updated with the new evidence kind on success.
+`mphf.bin` and `unitigs.bin` are never modified.

 ---

-## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index)
+## `estimate` command

-Build the MPHF to output the global k-mer position directly:
+`estimate` is a dry-run that resolves and prints (b, z, fp) without touching
+any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
+additionally accepts `-k` to display the effective indexed k-mer length:

 ```
-mphf: kmer → global_pos ∈ [0, P)
+k (query):             31
+k (indexed):           31
+z:                     1
+evidence bits (b):     8
+FP per k-mer:          3.906e-3  (1/2^8)
+FP per z-window:       3.906e-3  (1/2^8)
 ```

-Verification becomes:
-
-```
-global_pos = mphf.index(query)
-stored_kmer = unitigs.kmer_at_global_pos(global_pos)
-return canonical(stored_kmer) == canonical(query)
-```
-
-No evidence array.  The unitig block index (see below) provides
-`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time.
-
-### What is required
-
-A **retrieval data structure** (also called a value-based or function-based MPHF):
-given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`,
-build a compact structure that maps each key to its assigned value.
-
-Known constructions:
-
- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph +
-  XOR-based assignment.  ~2.3 bits/key overhead over the information-theoretic
-  minimum.  Construction: O(n).  Query: O(1).
- **SSHash approach**: builds the MPHF to map k-mers to their positions in a
-  concatenated unitig string.  Achieves elimination of external evidence using a
-  "skew" construction that aligns the MPHF output with the sequential unitig layout.
-
-### Rust availability
-
-No Rust crate implements a retrieval data structure suitable for this use case as
-of 2025.  The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain
-MPHFs.  **This is the key blocking factor.**
-
-### SSHash construction (reference)
-
-SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over
-(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential
-positions in the concatenated unitig string.  A port to obikmer would require:
-
-1. Concatenating all unitig sequences into a single flat buffer per partition.
-2. Assigning each k-mer a global position (its offset in that buffer).
-3. Building the MPHF to output that position directly (retrieval step).
-4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`.
-
---
-
-## Approach 2 — block index prefix sums (reduces evidence to negligible)
-
-A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled
-offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset
-table.
-
-### Extension: k-mer prefix sums per block
-
-Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before
-block `b`.  For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**.
-
-This enables `kmer_at_global_pos(p)`:
-
-1. Binary search in `kmer_prefix[]` to find block `b`.
-2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`.
-3. Extract the k-mer at the remaining rank within the found chunk.
-
-Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.
-
-### Combined with Approach 1
-
- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions)
- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)
-
---
-
-## Recommended path
-
-1. **Short term (approximate index)**: implement MPHF + fingerprint.bin.  Choose
-   (B, z) as index parameters.  Drop `evidence.bin` and `unitigs.bin.idx`; keep
-   `unitigs.bin` for K-mer enumeration.  Expected size: ~43 MB/partition at B=8
-   vs ~142 MB for the exact lookup layer.
-
-2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`.
-   Zero cost if evidence.bin is kept; enables Approach 1 when ready.
-
-3. **Medium term**: implement GOV retrieval data structure in Rust, or port
-   SSHash construction.
-
-4. **Long term**: replace `evidence.bin` with the value-based MPHF.  Expected
-   index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.
-
---
-
-## Open questions
-
- Is a GOV construction compatible with the parallel MPHF build currently used
-  (PtrHash's `new_from_par_iter`)?  GOV construction is inherently sequential
-  (hypergraph peeling); parallelisation is non-trivial.
- Can the SSHash "skew" insight be reused without the minimizer-bucket structure?
-  The obikmer partitioning already uses minimizers — there may be natural alignment.
- What is the query latency impact of replacing O(1) evidence lookup with
-  O(log n_blocks + BLOCK_SIZE) scan?  Needs benchmarking at realistic BCT scale.
- What is the optimal (B, z) trade-off for the approximate index given the target
-  read length and acceptable P(FP_read)?
+Useful for choosing parameters before committing to an index build.