docs: update architecture and storage specs for approximate index

Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore<S>` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details.
This commit is contained in:
Eric Coissac
2026-05-23 13:24:25 +02:00
parent b7db3a33ed
commit da56c3e290
5 changed files with 814 additions and 882 deletions
+224 -267
View File
@@ -2,169 +2,155 @@
## Fundamental invariant ## Fundamental invariant
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination.
--- ---
## Three-level hierarchy ## Three-level hierarchy
``` ```
PartitionedIndex KmerIndex (index.meta + KmerPartition)
├── LayeredPartition (one per minimiser bucket) ├── partition_0/index/ one directory per minimiser bucket
│ ├── MphfLayer 0 kmer → slot (immutable bijection) │ ├── meta.json PartitionMeta { n_layers }
│ ├── DataStore A slot → T (e.g. counts) │ ├── layer_0/
│ │ ── DataStore B slot → T (e.g. presence/absence, derived) │ │ ── layer_meta.json LayerMeta { evidence: EvidenceKind }
│ ├── MphfLayer 1 │ ├── mphf.bin PtrHash MPHF
│ │ ── DataStore A │ │ ── unitigs.bin unitig spine (never overwritten)
└── ... │ ├── evidence.bin exact evidence (Exact only)
├── LayeredPartition │ │ ├── unitigs.bin.idx block index (Exact only)
└── ... │ ├── fingerprint.bin fingerprints (Approx only)
│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true)
│ │ └── presence/ PersistentBitMatrix
│ └── layer_1/
│ └── ...
└── partition_1/index/
└── ...
``` ```
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. **KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon.
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. **Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets.
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. **Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`.
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
--- ---
## MphfLayer — autonomous mapping layer ## IndexConfig and IndexMeta
```rust ```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent pub struct IndexConfig {
MphfLayer::n() -> usize // number of slots pub kmer_size: usize,
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin pub minimizer_size: usize,
MphfLayer::open(dir: &Path) -> OLMResult<Self> pub n_bits: usize, // log2(n_partitions)
pub with_counts: bool,
pub evidence: EvidenceKind,
pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig
}
pub struct IndexMeta {
pub version: u32,
pub config: IndexConfig,
pub genomes: Vec<GenomeInfo>, // ordered; index = genome column number
}
pub struct GenomeInfo {
pub label: String,
pub meta: HashMap<String, String>, // arbitrary categorical metadata
}
``` ```
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. `IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index.
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. ---
## EvidenceKind
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
Controls which files are written per layer and which query path is taken:
| Variant | Files written | False-positive rate |
|---|---|---|
| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 |
| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) |
`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built).
---
## MphfLayer — autonomous kmer → slot mapping
```rust
pub struct MphfLayer {
mphf: PtrHash<>,
ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint }
n: usize,
}
```
`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range.
```
find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs
find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer)
```
`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`.
The MPHF and `unitigs.bin` are never rebuilt by any post-build operation.
---
## Layer\<D\> — MPHF + data payload
```rust
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
```
`D` selects the attached data payload:
| `D` | Data directory | `Item` returned by `query` |
|---|---|---|
| `()` | — | `()` (set membership only) |
| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) |
| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) |
`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option<Hit<D::Item>>`.
Build-time entry points:
```rust
Layer<()>::build(out_dir, block_bits) // set membership
Layer<PersistentCompactIntMatrix>::build(out_dir, block_bits, count_of)
Layer<PersistentBitMatrix>::build_presence(out_dir, block_bits, n_genomes, present_in)
Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path)
```
--- ---
## DataStore — slot-indexed data ## DataStore — slot-indexed data
```rust `PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs.
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
```
Concrete types from `obicompactvec`: | Type | `Item` | Aggregation method | Use |
| Type | `Item` | Column stats | Use |
|---|---|---|---| |---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot | | `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() Array1<u64>` | counts per genome per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot | | `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() Array1<u64>` | presence per genome per slot |
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
--- ---
## Distance matrix API on DataStore types ## Aggregation traits — `obicompactvec::traits`
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. Three traits unify the aggregation API across all hierarchy levels.
### Full distance matrices
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
```rust
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self) -> Array2<f64>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
fn euclidean_dist_matrix(&self) -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self) -> Array2<f64>
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
// PersistentBitMatrix
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
```
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
### Partial distance matrices
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
**Category 1 — self-contained partials**: additive without any external parameter.
```rust
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
-> (Array2<u64>, // sum_min[i,j]
Array1<u64>) // col_sums[k]
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
```
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
```rust
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
```
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
---
## Progressive aggregation principle
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
```
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
```
The same cascade applies to every partial:
```
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
↓ element-wise Σ across layers
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
↓ element-wise Σ across partitions
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
```
Each level presents a stable trait surface to the level above; no level reaches two levels down.
---
## Traits — `obicompactvec::traits`
Three traits unify the aggregation API across all levels of the hierarchy.
```rust ```rust
trait ColumnWeights: Send + Sync { trait ColumnWeights: Send + Sync {
@@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync {
} }
trait CountPartials: ColumnWeights { trait CountPartials: ColumnWeights {
// self-contained partials (additive, no parameter) fn partial_bray(&self) -> Array2<u64>;
fn partial_bray(&self) -> Array2<u64>; fn partial_euclidean(&self) -> Array2<f64>;
fn partial_euclidean(&self) -> Array2<f64>; fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>); fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
// normalised partials (global col_weights passed in cascade) fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>; fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>; // provided finalisation methods with default impls
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>; fn bray_dist_matrix(&self) -> Array2<f64> { }
// provided finalisation methods (default implementations) fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { }
fn bray_dist_matrix(&self) -> Array2<f64> { } // …
fn euclidean_dist_matrix(&self) -> Array2<f64> { }
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { }
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { }
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { }
fn hellinger_dist_matrix(&self) -> Array2<f64> { }
} }
trait BitPartials: ColumnWeights { trait BitPartials: ColumnWeights {
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>); fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
fn partial_hamming(&self) -> Array2<u64>; fn partial_hamming(&self) -> Array2<u64>;
// provided // provided
fn jaccard_dist_matrix(&self) -> Array2<f64> { } fn jaccard_dist_matrix(&self) -> Array2<f64> { }
fn hamming_dist_matrix(&self) -> Array2<u64> { } fn hamming_dist_matrix(&self) -> Array2<u64> { }
} }
``` ```
**Leaf implementors** (in `obicompactvec`): Leaf implementors:
| Type | Traits | | Type | Traits |
|---|---| |---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` | | `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` | | `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` |
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
--- ---
## `LayeredStore<S>``obilayeredmap` ## LayeredStore\<S\> — recursive aggregation wrapper
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
```rust ```rust
pub struct LayeredStore<S>(Vec<S>); pub struct LayeredStore<S>(Vec<S>);
``` ```
Three blanket impls propagate the traits up the hierarchy: Three blanket impls propagate all traits up the hierarchy:
```rust ```rust
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { } // Σ across inner stores impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { }
impl<S: CountPartials> CountPartials for LayeredStore<S> { } // same pattern impl<S: CountPartials> CountPartials for LayeredStore<S> { }
impl<S: BitPartials> BitPartials for LayeredStore<S> { } // same pattern impl<S: BitPartials> BitPartials for LayeredStore<S> { }
``` ```
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed: This makes `LayeredStore<LayeredStore<PersistentCompactIntMatrix>>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed:
``` ```
PersistentCompactIntMatrix implements CountPartials PersistentCompactIntMatrix leaf (one layer)
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition) LayeredStore<PersistentCompactIntMatrix> one partition (layers are disjoint)
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index) LayeredStore<LayeredStore<…>> whole index (partitions are independent)
``` ```
### Normalised metrics two-pass cascade Normalised metrics require global column sums — computed in a two-pass cascade:
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
```rust ```rust
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>> // on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
let global = self.col_weights(); // pass 1 — progressive sum at every level let global = self.col_weights(); // pass 1 — sums up hierarchy
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately) p.mapv(|v| 1.0 - v)
} }
``` ```
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy. Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy.
---
## Progressive aggregation principle
No level reaches two levels down. Each level sums contributions from the level immediately below:
```
PersistentCompactIntMatrix::col_weights() — one (partition, layer)
↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — one partition
↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights() — global
```
The same cascade applies to every partial method.
---
## Multi-genome column invariant
After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
Maintained by three coordinated operations:
**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`.
**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers.
**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1.
This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape.
---
## Query model
### Point query
```
minimiser(kmer) → partition p
for each layer l in p:
if let Some(slot) = MphfLayer_l.find(kmer):
return data_l.read(slot)
return None
```
O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`.
### Aggregation
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(data_p_l)
)
```
For normalised metrics, replace with the two-pass cascade.
--- ---
@@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
| Level | Unit | Coordination | | Level | Unit | Coordination |
|---|---|---| |---|---|---|
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent | | Across partitions | inner stores of `LayeredStore<LayeredStore<S>>` | none |
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets | | Across layers within a partition | inner stores of `LayeredStore<S>` | none — disjoint kmer sets |
| Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 1 (`col_weights`) | per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction. ---
## reindex — evidence conversion in place
`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`:
- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin`
- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx`
On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`.
--- ---
## Query model ## estimate — parameter dry-run
### Point query — `kmer → Option<Item>` `estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing.
```
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
```
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
### Aggregation — `→ Result`
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
```
For normalised metrics replace with the two-pass scheme above.
---
## DataStore derivation
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
```
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
---
## Relationship to current implementation
### What is implemented
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
### What is not yet implemented
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
### Planned refactoring
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
---
## Multi-genome column invariant
### The invariant
After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).
### How it is maintained
The invariant is established and preserved by three coordinated operations:
**1. Existing layers — column append.**
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.
**2. New layers — absent columns prepended.**
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.
**3. First merge, Presence mode — `init_presence_matrix`.**
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.
### Why the invariant is required
The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.
The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.
+137 -280
View File
@@ -1,321 +1,178 @@
# Evidence elimination — design discussion # Approximate evidence: fingerprint-based index
## Problem statement ## Motivation
`evidence.bin` maps each MPHF slot to a position in the unitig store so that `evidence.bin` maps each MPHF slot to the position of the k-mer that owns it,
query verification is possible: given a slot `s` returned by `mphf.index(kmer)`, enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions,
retrieve the k-mer stored at that position and compare with the query. k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint:
On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition): | file | size/partition | fraction |
| file | size/partition | total (2048 parts) | fraction of lookup layer |
|---|---|---|---|
| evidence.bin | 132 MB | ~270 GB | **66 %** |
| unitigs.bin | 58 MB | ~118 GB | 29 % |
| mphf.bin | 10 MB | ~20 GB | 5 % |
Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage
optimisation available for index size.
---
## Why evidence exists
PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`.
For a query k-mer not in the indexed set, the returned slot is meaningless but
indistinguishable from a real hit without external information. Evidence provides
that information: `evidence[s]` encodes the location of the k-mer that legitimately
occupies slot `s`, allowing the verification:
```
slot = mphf.index(query)
(chunk_id, rank) = evidence.decode(slot)
stored_kmer = unitigs.kmer_at(chunk_id, rank)
return canonical(stored_kmer) == canonical(query)
```
Evidence is a **permutation** from MPHF-space to unitig-position-space.
Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.
---
## Information-theoretic lower bound
For a partition with P k-mers and U unitigs of average length m_u k-mers:
- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits
Current implementation: 25 + 7 = 32 bits (aligned u32).
Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.
**Packing headroom: ~22 %.** Not a path to elimination.
---
## Two-index architecture
The exact index is mandatory for set operations (union, intersection, diff) and
exact k-mer retrieval. A separate approximate index, built for query operations,
can tolerate a controlled false positive rate in exchange for a much smaller
footprint.
| component | exact index | approximate index |
|---|---|---| |---|---|---|
| `mphf.bin` | ✓ | ✓ (same structure) | | evidence.bin | 132 MB | 66 % |
| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ | | unitigs.bin | 58 MB | 29 % |
| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) | | mphf.bin | 10 MB | 5 % |
| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) |
| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) |
The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps `evidence.bin` is a bijection from MPHF-space to unitig-position-space and
`unitigs.bin` for sequential enumeration of K-mers. costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with
only ~22 % packing headroom. Compression is not a path to elimination.
The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a
`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set
operations still require an exact index; the approximate index targets query
workloads that can tolerate a bounded false-positive rate.
--- ---
## MPHF as a perfect Bloom filter ## The Findere model
A standard Bloom filter with a single hash function and N bits storing M keys has A B-bit fingerprint stored per MPHF slot provides the discrimination that
occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of `evidence.bin` would otherwise provide through full k-mer reconstruction.
landing on a set bit. The empty space (fraction 1 M/N of bits at 0) is what
rejects foreign k-mers.
An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint
occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any** stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
input to a slot in [0, M) — there is no empty space. Every query lands on an
occupied slot. The MPHF alone cannot reject foreign k-mers at all.
Adding a B-bit fingerprint restores the discrimination:
``` ```
slot = mphf.index(query) P(FP per k-mer) = 1 / 2^b
fingerprint = hash(query) & mask_B
present = fingerprint_table[slot] == fingerprint
``` ```
The fingerprint plays the role of the sparse space in the Bloom filter: it provides The Findere trick raises the effective window to z consecutive k-mers. A query
the B bits of information needed to reject foreign k-mers. succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
Both structures reach the same fundamental cost for a given FP rate. For 1% FP: ```
P(FP per z-window) = 1 / 2^(b·z)
```
- Bloom filter (optimal, k hash functions): ~9.6 bits/key The effective indexed k-mer length is `k z + 1`: a query for a (k+z1)-mer
- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key decomposes into z overlapping k-mers, all of which must match.
This is a fundamental bound, not an implementation detail. Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
--- ---
## Approach A — MPHF + fingerprint (approximate index) ## `FingerprintVec` on disk
### Size `fingerprint.bin` layout:
| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) | ```
magic: b"FPVF" (4 bytes)
b: u8 (bits per slot, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n·b/8) bytes, Lsb0 order
```
`FingerprintVec` is memory-mapped. The match check against a query k-mer:
```rust
fn matches(&self, slot: usize, fingerprint: u64) -> bool {
self.get(slot) == (fingerprint & self.mask)
}
```
`build_approx_evidence` iterates `unitigs.bin` sequentially, writes
`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin`
and `layer_meta.json`. No `.idx` file is produced; random access into
`unitigs.bin` is not needed.
At build time, `find_approx` in `MphfLayer`:
```rust
let slot = self.mphf.index(&kmer.raw());
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
```
---
## `EvidenceKind` and metadata
`layer_meta.json` records which evidence bundle is present:
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or
`find_approx` transparently. `find_exact` panics on an approximate layer;
`find_approx` panics on an exact layer — mode mixing is a programming error.
---
## Parameter resolution (`resolve_approx_params`)
The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third.
`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling
rounding:
| given | derived |
|---|---|
| b, z | fp = 1/2^(b·z) |
| z, fp | b = ⌈−log₂(fp) / z⌉ |
| b, fp | z = ⌈−log₂(fp) / b⌉ |
| z only | b = 8 (default), fp derived |
| b only | z = 1 (default), fp derived |
| fp only | b = 8 (default), z derived |
| none | b = 8, z = 1, fp = 1/256 |
When all three are given, b and z are authoritative and fp is recomputed.
---
## CLI flags
Both `index` and `reindex` accept the same flags:
| flag | type | meaning |
|---|---|---| |---|---|---|
| 8 | 33 MB | 4× smaller | | `--approx` | bool | enable fingerprint evidence |
| 12 | 49 MB | 2.7× smaller | | `--evidence-bits` (`b`) | u8 | fingerprint bits per slot |
| 16 | 66 MB | 2× smaller | | `-z` | u8 | Findere z parameter |
| `--fp` | f64 | target FP rate per z-window |
| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode |
Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer). `--approx` must be set explicitly; the other three flags are optional and
resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1.
### False positive rate — per k-mer query
For a specific non-indexed query k-mer q:
1. MPHF(q) → slot s, some value in [0, M)
2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
3. FP event: hash(q) & mask_B == fingerprint_table[s]
Since q is not the legitimate k-mer at s, its fingerprint is independent of
fingerprint_table[s], giving:
```
P(FP per k-mer) = 1 / 2^B
```
This is the probability of error **for one specific query k-mer**. It is not the
fraction of the k-mer universe that would be misclassified: querying all 4^k
possible k-mers would yield (4^k M)/2^B false positives in absolute terms, but
that is not the relevant quantity for practical use.
### Equivalence classes
The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B
equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true
indexed k-mer and 4^k/(M·2^B) 1 false positives. A larger M (fewer partitions)
produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B
remains constant.
### Read-level use case
The relevant decision unit is the **read**, not the individual k-mer. For a read
of ~100 nucleotides and k=31, there are ~70 k-mers.
- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are
true positives → high coverage fraction.
- A plant read queried against a bacterial index: k-mers are foreign; each has
P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.
A coverage threshold separates the two cases decisively. This is the same
principle as Findere: local coverage continuity distinguishes true hits from noise.
--- ---
## Approach B — z-consecutive k-mer matching ## `reindex` command
A query for a K-mer of size K = k + z 1 decomposes into z overlapping k-mers. `reindex` converts an existing index between exact and approximate evidence
Declaring a match only when **all z are present** reduces the per-window FP rate: in-place across all partitions and layers, running partitions in parallel via
Rayon.
``` Conversion to approximate (`--approx`):
P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)
```
For a read with ~70 k-mers, there are ~70 z + 1 independent windows of size z. - Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
The probability that at least one window is a false positive: - Removes `evidence.bin` and `unitigs.bin.idx`.
- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`.
``` Conversion to exact (default, no `--approx`):
P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)
```
For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸. - Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
- Removes `fingerprint.bin`.
- Updates `layer_meta.json` with `EvidenceKind::Exact`.
A plant read is misclassified as bacterial roughly once in 60 million reads — The root `index.meta` is updated with the new evidence kind on success.
negligible for any practical dataset. `mphf.bin` and `unitigs.bin` are never modified.
### Choosing B from (z, L, P_target)
z is a query-time parameter and does not affect the index structure. However,
knowing z at build time allows computing the minimum B required to reach a target
FP rate P_target for reads of length L (giving W = L k z + 2 independent
windows):
```
P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z )
```
Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**.
(B, z) are co-determined at build time to minimise fingerprint size while
guaranteeing the target read-level FP rate.
### Combined sizing
| B | z | K = k+z1 | P(FP_read) | fingerprint.bin/partition |
|---|---|---|---|---|
| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB |
| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB |
| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB |
| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB |
Smaller B → smaller fingerprint table; larger z → longer minimum match length K
and fewer independent windows per read.
--- ---
## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index) ## `estimate` command
Build the MPHF to output the global k-mer position directly: `estimate` is a dry-run that resolves and prints (b, z, fp) without touching
any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
additionally accepts `-k` to display the effective indexed k-mer length:
``` ```
mphf: kmer → global_pos ∈ [0, P) k (query): 31
k (indexed): 31
z: 1
evidence bits (b): 8
FP per k-mer: 3.906e-3 (1/2^8)
FP per z-window: 3.906e-3 (1/2^8)
``` ```
Verification becomes: Useful for choosing parameters before committing to an index build.
```
global_pos = mphf.index(query)
stored_kmer = unitigs.kmer_at_global_pos(global_pos)
return canonical(stored_kmer) == canonical(query)
```
No evidence array. The unitig block index (see below) provides
`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time.
### What is required
A **retrieval data structure** (also called a value-based or function-based MPHF):
given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`,
build a compact structure that maps each key to its assigned value.
Known constructions:
- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph +
XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic
minimum. Construction: O(n). Query: O(1).
- **SSHash approach**: builds the MPHF to map k-mers to their positions in a
concatenated unitig string. Achieves elimination of external evidence using a
"skew" construction that aligns the MPHF output with the sequential unitig layout.
### Rust availability
No Rust crate implements a retrieval data structure suitable for this use case as
of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain
MPHFs. **This is the key blocking factor.**
### SSHash construction (reference)
SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over
(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential
positions in the concatenated unitig string. A port to obikmer would require:
1. Concatenating all unitig sequences into a single flat buffer per partition.
2. Assigning each k-mer a global position (its offset in that buffer).
3. Building the MPHF to output that position directly (retrieval step).
4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`.
---
## Approach 2 — block index prefix sums (reduces evidence to negligible)
A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled
offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset
table.
### Extension: k-mer prefix sums per block
Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before
block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**.
This enables `kmer_at_global_pos(p)`:
1. Binary search in `kmer_prefix[]` to find block `b`.
2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`.
3. Extract the k-mer at the remaining rank within the found chunk.
Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.
### Combined with Approach 1
- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions)
- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)
---
## Recommended path
1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose
(B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep
`unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8
vs ~142 MB for the exact lookup layer.
2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`.
Zero cost if evidence.bin is kept; enables Approach 1 when ready.
3. **Medium term**: implement GOV retrieval data structure in Rust, or port
SSHash construction.
4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected
index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.
---
## Open questions
- Is a GOV construction compatible with the parallel MPHF build currently used
(PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential
(hypergraph peeling); parallelisation is non-trivial.
- Can the SSHash "skew" insight be reused without the minimizer-bucket structure?
The obikmer partitioning already uses minimizers — there may be natural alignment.
- What is the query latency impact of replacing O(1) evidence lookup with
O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale.
- What is the optimal (B, z) trade-off for the approximate index given the target
read length and acceptable P(FP_read)?
+212 -133
View File
@@ -2,7 +2,7 @@
## Purpose ## Purpose
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. `obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
--- ---
@@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
--- ---
## Evidence kinds
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
```rust
pub enum EvidenceKind {
Exact,
Approx { b: u8, z: u8 },
}
```
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L k z + 2 is the number of windows in a read of length L. No `.idx` written or required.
---
## MphfLayer — autonomous kmer → slot mapping ## MphfLayer — autonomous kmer → slot mapping
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data. `MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
```rust ```rust
pub struct MphfLayer { pub struct MphfLayer {
mphf: Mphf, mphf: Mphf,
evidence: Evidence, ev: LayerEvidence, // loaded at open() time
unitigs: UnitigFileReader, n: usize,
n: usize, // number of indexed kmers = number of MPHF slots
} }
``` ```
Public API: `LayerEvidence` is an internal enum, not public:
```rust ```rust
impl MphfLayer { enum LayerEvidence {
pub fn open(dir: &Path) -> OLMResult<Self> Exact { evidence: Evidence, unitigs: UnitigFileReader },
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None Approx { fingerprint: FingerprintVec },
pub fn n(&self) -> usize
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
pub(crate) fn build(
dir: &Path,
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
) -> OLMResult<usize>
} }
``` ```
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test). ### Query API
`build` runs two sequential passes over `unitigs.bin`: Three public query methods, all returning `Option<usize>` (slot index):
1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. ```rust
2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
```
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`. - `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
### Build surface
```rust
// Full MPHF + exact evidence build (two-pass, parallel)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
// Evidence-only builds (MPHF already present in dir)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
```
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
--- ---
@@ -81,7 +120,7 @@ pub struct Hit<T = ()> {
} }
``` ```
`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait. `LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait.
| Type | `Item` | Description | | Type | `Item` | Description |
|---|---|---| |---|---|---|
@@ -89,31 +128,118 @@ pub struct Hit<T = ()> {
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
**Build signatures:** ### Build signatures
```rust ```rust
// mode 1 // mode 1
impl Layer<()> { impl Layer<()> {
pub fn build(out_dir: &Path) -> OLMResult<usize> pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
} }
// mode 2 // mode 2
impl Layer<PersistentCompactIntMatrix> { impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> pub fn build(out_dir: &Path, block_bits: u8,
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize> count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, block_bits: u8,
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
} }
// mode 3 // mode 3
impl Layer<PersistentBitMatrix> { impl Layer<PersistentBitMatrix> {
pub fn build_presence( pub fn build_presence(out_dir: &Path, block_bits: u8,
out_dir: &Path, n_genomes: usize,
n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
present_in: impl Fn(CanonicalKmer, usize) -> bool,
) -> OLMResult<usize>
} }
``` ```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`. All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
### Evidence build helpers on Layer
```rust
impl<D: LayerData> Layer<D> {
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
}
```
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
---
## FingerprintVec and FingerprintVecWriter
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
```
fingerprint.bin format:
magic: b"FPVF" (4 bytes)
b: u8 (bits per fingerprint, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
```
```rust
impl FingerprintVec {
pub fn open(path: &Path) -> OLMResult<Self>
pub fn get(&self, slot: usize) -> u64
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
pub fn n(&self) -> usize
pub fn b(&self) -> u8
}
```
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`.
---
## LayeredMap\<D\> — collection of layers
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
```rust
pub struct LayeredMap<D: LayerData = ()> {
root: PathBuf,
meta: PartitionMeta,
layers: Vec<Layer<D>>,
}
```
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`.
### Common methods
```rust
pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self>
pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
```
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
### push_layer
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`:
```rust
// mode 1
impl LayeredMap<()> {
pub fn push_layer(&mut self) -> OLMResult<usize>
}
// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
```
Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer<PersistentBitMatrix>::build_presence`.
--- ---
@@ -131,14 +257,6 @@ impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wi
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
**Aggregation hierarchy:**
```
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
```
**Leaf implementors** (in `obicompactvec`): **Leaf implementors** (in `obicompactvec`):
| Type | Traits | | Type | Traits |
@@ -146,8 +264,6 @@ LayeredStore<LayeredStore<…>> via blanket impl (partitione
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` |
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern.
--- ---
@@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
## On-disk structure ## On-disk structure
``` ```
index_root/ ← LayeredMap (collection) partition_root/ ← LayeredMap (one partition)
meta.json meta.json — {"n_layers": N}
part_00000/ ← Partition layer_0/ ← Layer
layer_0/ ← Layer layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
mphf.bin — ptr_hash MPHF (epserde format) mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] unitigs.bin.idx — UIDX index (exact evidence only)
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE evidence.bin — [u32; n], LE (exact evidence only)
counts/ [mode 2] PersistentCompactIntMatrix fingerprint.bin — packed b-bit array (approx evidence only)
meta.json {"n": N, "n_cols": 1} counts/ [mode 2] PersistentCompactIntMatrix
col_000000.pciv meta.json
presence/ [mode 3] PersistentBitMatrix col_000000.pciv
meta.json {"n": N, "n_cols": G} presence/ [mode 3] PersistentBitMatrix
col_000000.pbiv meta.json
col_000000.pbiv
layer_1/ layer_1/
part_00001/
``` ```
**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel. `unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
--- ---
## Evidence encoding ## Evidence encoding (exact)
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
@@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
``` ```
Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`. `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer. For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
--- ---
@@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the
type Mphf = PtrHash< type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano) CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
Xx64, // hasher: XXH3-64 with seed Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots Vec<u8>, // pilots
>; >;
@@ -211,21 +323,41 @@ type Mphf = PtrHash<
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. `CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space.
--- ---
## Query path ## Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF.
### Layer-level genome column append
```rust ```rust
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> { impl Layer<PersistentBitMatrix> {
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) }) pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
} }
``` ```
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit. Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n).
In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0. ### Presence matrix initialisation
```rust
impl Layer<()> {
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
```
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
### Why the MPHF is never rebuilt
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update.
--- ---
@@ -235,9 +367,9 @@ When adding dataset B to an existing index:
1. For each partition, probe existing layers for kmers of B routed to that partition. 1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index`. 2. Collect kmers absent from all layers → `B \ index`.
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`. 3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`.
4. Call `Layer<D>::build` on the new directory. 4. Call `Layer<D>::build` (or `build_presence`) on the new layer directory.
5. Update `meta.json`. 5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`.
Each partition's new layer is built independently; the operation is fully parallel across partitions. Each partition's new layer is built independently; the operation is fully parallel across partitions.
@@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall
| `ptr_hash 1.1` | MPHF per layer | | `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash | | `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation | | `epserde 0.8` | zero-copy MPHF serialisation |
| `memmap2 0.9` | mmap of evidence and payload files | | `memmap2 0.9` | mmap of evidence and fingerprint files |
| `obiskio` | unitig file writer/reader | | `bitvec` | packed b-bit fingerprint storage |
| `obiskio` | unitig file writer/reader + `.idx` build |
| `obicompactvec` | payload types + aggregation traits | | `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass | | `rayon 1` | parallel MPHF construction pass |
| `ndarray 0.16` | aggregation output arrays | | `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
---
## Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
### Matrix column append
```rust
impl PersistentCompactIntMatrix {
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
impl PersistentBitMatrix {
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
```
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
### Presence matrix initialisation
```rust
impl Layer<()> {
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
```
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
### Layer-level genome column append
```rust
impl Layer<PersistentBitMatrix> {
pub fn append_genome_column(
layer_dir: &Path,
value_of: impl Fn(usize) -> bool,
) -> OLMResult<()>
}
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(
layer_dir: &Path,
value_of: impl Fn(usize) -> u32,
) -> OLMResult<()>
}
```
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
### Why the MPHF is never rebuilt
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
+134 -3
View File
@@ -1,5 +1,136 @@
# On-disk collection structure # On-disk index layout
See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout. ## Directory tree
The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory. ```
<index_root>/
index.meta ← JSON: IndexMeta
scatter.done ← sentinel: scatter phase complete
count.done ← sentinel: dereplicate + count complete
index.done ← sentinel: MPHF index fully built
spectrums/
<label>.json ← kmer frequency spectrum per genome
partitions/
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits1)
index/
meta.json ← PartitionMeta { n_layers }
layer_0/
unitigs.bin ← binary unitig sequences (2-bit packed)
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
mphf.bin ← serialised PtrHash MPHF
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
presence/ ← PersistentBitMatrix (if presence mode, merge)
layer_1/ ← added by merge; same structure as layer_0
layer_2/ …
part_00001/ …
```
## State machine (sentinels)
The sentinels are touched atomically at the end of each pipeline stage.
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
detected as the lowest sentinel present.
| State | Sentinel present | Meaning |
|---|---|---|
| `Empty` | — | `index.meta` exists; scatter not started or interrupted |
| `Scattered` | `scatter.done` | All super-kmers routed to partition files |
| `Counted` | `count.done` | Partitions dereplicated; `spectrums/` written |
| `Indexed` | `index.done` | All MPHF layers built; index ready for queries |
## index.meta (IndexMeta)
```json
{
"version": 1,
"config": {
"kmer_size": 31,
"minimizer_size": 11,
"n_bits": 8,
"with_counts": false,
"evidence": "Exact",
"block_bits": 0
},
"genomes": [
{ "label": "genome_A", "meta": { "species": "Homo sapiens" } }
]
}
```
`n_bits` determines the partition count: `2^n_bits` directories under `partitions/`.
`evidence` is either the string `"Exact"` or `{"Approx": {"b": 8, "z": 1}}`.
`block_bits` controls the `.idx` granularity: one offset entry every `2^block_bits`
chunks. `block_bits=0` stores one entry per chunk (O(1) random access, largest `.idx`).
`GenomeInfo.meta` is a free-form string→string map for categorical metadata (e.g.
taxonomy, sample origin). It is optional; defaults to empty.
## Layer files
### unitigs.bin
2-bit packed binary unitig sequences. Each record: 1 byte `seql_minus_k`
(nucleotide length k), followed by `ceil((seql_minus_k + k) / 4)` bytes of
packed sequence. Long unitigs are transparently split into overlapping chunks
(k1 nucleotide overlap) so no k-mer crosses a chunk boundary.
### unitigs.bin.idx (Exact only)
Magic `UIX3`, little-endian header: `block_bits` (u32), `n_unitigs` (u32),
`n_kmers` (u64), then `ceil(n_unitigs / 2^block_bits) + 1` byte-offset entries
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.
### mphf.bin
PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
2-bit) to a slot index in `[0, n_kmers)`.
### layer_meta.json (LayerMeta)
```json
{ "evidence": { "type": "exact" } }
```
or
```json
{ "evidence": { "type": "approx", "b": 8, "z": 1 } }
```
### evidence.bin (Exact)
One `(chunk_id: u32, rank: u8)` record per MPHF slot, packed. Used to verify
that the kmer mapped to a slot is actually present: `unitigs.bin[chunk_id][rank]`
is re-read and compared against the query.
### fingerprint.bin (Approx)
`b`-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
False-positive rate per query ≈ `1/2^b`. With Findere parameter `z ≥ 2`,
`z` consecutive k-mers must all match, reducing the effective FP rate to
approximately `W / 2^(b·z)` per read of length `L`
(where `W = L k z + 2`).
### counts/ (PersistentCompactIntMatrix)
Present when `with_counts=true`. One column per genome; each row holds the
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
during indexing and merge.
### presence/ (PersistentBitMatrix)
Present when the layer was built in presence/absence mode (merge path).
One bit per genome per MPHF slot. Written during merge; never present on a
freshly indexed single-genome layer.
## meta.json (PartitionMeta)
```json
{ "n_layers": 2 }
```
Records how many `layer_N/` directories exist under `index/`. Incremented by
each merge that adds a layer.
+107 -199
View File
@@ -2,243 +2,139 @@
## Role of unitigs in the index ## Role of unitigs in the index
The MPHF maps each canonical kmer to an integer slot, but provides no way to reconstruct the kmer from its slot. A downstream operation (query, set operation) that receives a slot index and needs the kmer sequence must be able to retrieve it. The **evidence file** serves this purpose: it stores the kmer sequences in compact form and provides, for each MPHF slot, a pointer to where the corresponding kmer can be decoded. The MPHF maps each canonical kmer to an integer slot but provides no inverse: a slot index alone cannot reconstruct the kmer. The **evidence file** supplies this inverse: for each MPHF slot it stores a pointer into the unitig sequence file, from which k nucleotides can be extracted.
Unitigs are the natural compact representation: a run of L nucleotides encodes L k + 1 consecutive canonical kmers. The entire kmer set of a partition can be reconstructed from its unitig FASTA file. Unitigs are the natural compact representation: a run of L nucleotides encodes L k + 1 consecutive canonical kmers. The entire kmer set of a partition is reconstructible from its unitig binary file.
--- ---
## Two encoding strategies ## Binary file formats
### Strategy A — global nucleotide offset ### `unitigs.bin` — sequence chunks
Each MPHF slot stores a single integer: the byte offset of the kmer's first nucleotide within a packed 2-bit nucleotide array that concatenates all unitigs. A sequence of binary records. Each record:
``` ```
evidence[slot] = global_offset (bits: ⌈log₂ N_nuc⌉) [u8: seql k] [ceil(seql / 4) bytes: 2-bit packed nucleotides]
``` ```
where `N_nuc` is the total number of nucleotides across all unitigs in the partition. - `seql k` (0255): nucleotide length minus k, so `seql = byte[0] + k` and `n_kmers = byte[0] + 1`.
- Packed nucleotides: A=00, C=01, G=10, T=11, MSB-first within each byte; last byte zero-padded.
- Byte count for packed sequence: `ceil(seql / 4)`.
Decoding: read k nucleotides starting at `global_offset`. Unitigs with more than `MAX_KMERS_PER_CHUNK = 256` k-mers are transparently split into overlapping chunks. Each chunk has at most 256 k-mers (= `seql k + 1 ≤ 256`); consecutive chunks overlap by k1 nucleotides so no kmer is lost:
### Strategy B — (unitig_id, rank within unitig)
Each MPHF slot stores a pair:
``` ```
evidence[slot] = (unitig_id, rank) chunk 1: nucleotides [0, MAX_KMERS_PER_CHUNK + k 2] (256 kmers)
chunk 2: nucleotides [256, end] (remaining kmers)
overlap: k1 nucleotides shared between the two chunks
``` ```
- `unitig_id` : index of the unitig in the partition (0-based) ### `unitigs.bin.idx` — block-sampled offset index
- `rank` : kmer index within the unitig (0 ≤ rank < n_kmers); kmer i starts at nucleotide i, so the nucleotide offset is identical numerically but the kmer-unit interpretation is the natural one
Decoding: look up the unitig at `unitig_id`, then read k nucleotides starting at `rank`. ```
magic : 4 bytes = "UIX3"
block_bits: u32 LE — granularity parameter (031)
n_unitigs : u32 LE — total number of chunks in unitigs.bin
n_kmers : u64 LE — total number of kmers across all chunks
offsets : [u32 LE] — byte offsets into unitigs.bin, one per 2^block_bits chunks + sentinel
```
One offset entry is stored every `2^block_bits` chunks; the array is sentinel-terminated (last entry = file size). `DEFAULT_BLOCK_BITS = 0` stores one offset per chunk (exact table, no scan).
### `evidence.bin` — per-slot MPHF evidence
A flat array of u32 values, one per MPHF slot, no header:
```
bits [31:7] = chunk_id (25 bits)
bits [6:0] = rank (7 bits, 0127)
```
File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record in `unitigs.bin`; `rank` is the position of the canonical kmer within that chunk (counting only canonical kmers). Encoding: `raw = (chunk_id << 7) | (rank & 0x7F)`. Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`.
--- ---
## Bit-cost analysis ## Building and reading the index
Define for a partition of P kmers with average kmers-per-unitig m: ### `build_unitig_idx(path, block_bits)`
- total nucleotides: $N_{nuc} = P \cdot \left(1 + \dfrac{k-1}{m}\right)$ Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
- number of unitigs: $U = P / m$
**Strategy A** ### `open()` vs `open_sequential()`
$$ `UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
b_A = \left\lceil \log_2 N_{nuc} \right\rceil = \left\lceil \log_2 P + \log_2\!\left(1 + \frac{k-1}{m}\right) \right\rceil
$$
**Strategy B** `UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
$$ ### `chunk_start(i)` — random access
b_B = \left\lceil \log_2 U \right\rceil + \left\lceil \log_2 L_{max} \right\rceil
$$
where $L_{max}$ is the maximum unitig length (in nucleotides). In practice $L_{max} \ll P$, so the rank field is much cheaper than the full global offset. If unitig lengths are bounded (e.g. by partition structure), the rank field width is a small constant independent of P. ```rust
fn chunk_start(&self, i: usize) -> usize {
### Empirical bound on unitig length // block_bits=0: single table lookup, O(1) — hot path
if self.block_bits == 0 {
Lengths and ranks are expressed in **kmer units** (not nucleotides): the nucleotide length is `n_kmers + k 1`, so storing `n_kmers` instead of `seq_length` saves k1 = 30 units of headroom in the same field width. return self.block_offsets[i] as usize;
}
Consequence for `u8` capacity: // block_bits>0: lookup block, then scan at most 2^block_bits 1 records
let block = i >> self.block_bits;
| unit | max representable | max nucleotides | let rem = i & self.mask;
|---|---|---| let mut offset = self.block_offsets[block] as usize;
| nucleotides | 255 nuc | 225 kmers | for _ in 0..rem {
| **kmers** | **255 kmers** | **285 nuc** | let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
**Structural maximum from superkmer construction.** For k=31 and m=11, the maximum number of consecutive kmers sharing the same minimiser is k m + 1 = **21 kmers** (the minimiser traverses from position km to 0 as the window slides). A unitig that is a single full superkmer therefore has exactly 21 kmers. This is confirmed by a bimodal distribution in empirical data: a sharp peak at 21 kmers appears in all partitions, including the anomalous partition 145. The observed maximum is ~46 kmers (unitigs spanning more than one superkmer), well within u8 range. }
offset
On *Betula nana* (k=31, 256 partitions), m_u ≈ 37.9 kmers/unitig on average. The `rank` field (kmer index within the unitig) fits in a `u8` as long as no unitig exceeds 255 kmers — guaranteed by the split strategy below and amply satisfied by empirical maximums (~46 kmers observed). }
### Split strategy for long unitigs
For the rare cases where a unitig exceeds 255 kmers, the unitig is split into chunks of at most 255 kmers, with a **k1 nucleotide overlap** at each junction — identical to the way super-kmers are delimited at partition boundaries. Each chunk is self-contained and independently decodable.
```
original unitig: kmer_0 … kmer_254 | kmer_255 … kmer_N
↑ cut here
chunk 1: nucleotides 0 … 284 (255 kmers)
chunk 2: nucleotides 255 … N+k-1 (N-255+1 kmers)
shared: nucleotides 255 … 284 (k-1 = 30 nucleotides, stored in both)
``` ```
Cost of one split: k1 = 30 redundant nucleotides = 60 bits. This event is rare in practice (m_u ≈ 38 for *B. nana*, well below the 255-kmer cap). No kmer is lost: kmer i is in chunk 1 if i < 255, in chunk 2 (at rank i255) otherwise. With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
### Savings from u8 length fields With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
Because all chunks are guaranteed ≤ 255 kmers, the per-chunk length array in the binary index is a flat `u8` array — 1 byte per chunk instead of 8 bytes (usize) or 4 bytes (u32). For a partition with 4 M unitigs: ### Decoding a kmer from slot `s`
| length type | bytes/chunk | total (4 M chunks) | ```rust
|---|---|---| let (chunk_id, rank) = evidence.decode(s); // u32 → (chunk_id: u32, rank: u8)
| usize (u64) | 8 | 32 MB | let kmer = unitigs.raw_kmer(chunk_id, rank); // 2-bit packed slice → left-aligned u64
| u32 | 4 | 16 MB | ```
| **u8** | **1** | **4 MB** |
Random access to chunk i is recovered at load time by a single prefix-sum pass over the u8 array, computing a u32/u64 offset array in O(n_chunks) time and O(n_chunks × 4) bytes — paid once at open time, cached for the lifetime of the partition handle. Two memory accesses: one 4-byte read from `evidence.bin`, one packed-bit extraction from `unitigs.bin` via the mmap. The retrieved sequence is already canonical (only canonical kmers are inserted into the De Bruijn graph).
Bit costs for *Betula nana* (k=31, 256 partitions, P ≈ 10.4 M, U ≈ 275 k, m_u ≈ 37.9):
| field | strategy A | strategy B |
|---|---|---|
| offset / id | $\lceil\log_2(P \cdot (1 + 30/m_u))\rceil = 25$ bits | $\lceil\log_2(U)\rceil = 19$ bits |
| rank | — | 8 bits (u8, fixed) |
| **total** | **25 bits** | **27 bits** |
Strategy A is 2 bits cheaper. Strategy B's main advantage is **locality**: decoding a kmer touches one unitig's cache lines rather than an arbitrary offset in a large flat array, and the `rank` field doubles as a direct index into the packed nucleotide sequence without pointer arithmetic.
--- ---
## Partition-size tradeoff ## Field widths and capacity
The total bits/kmer for the index (sequence + evidence + MPHF) as a function of partition size is: | field | bits | range | capacity check (*B. nana*, 256 partitions) |
|------------|------|---------------|---------------------------------------------|
| `seql k` | 8 | 0255 | max `n_kmers` per chunk = 256 = `MAX_KMERS_PER_CHUNK` |
| `rank` | 7 | 0127 | observed max ~46 kmers/chunk; structural max km+1 = 21 |
| `chunk_id` | 25 | 033 554 431 | avg U ≈ 275 k chunks/partition |
$$ The rank field is 7 bits (max 127) even though chunks can contain up to 256 k-mers, because rank counts only canonical kmers within the chunk, and the canonical kmer count is at most half the total.
\text{total} = \underbrace{2\!\left(1 + \frac{k-1}{m}\right)}_{\text{sequence}} + \underbrace{\log_2 P + \log_2\!\left(1+\frac{k-1}{m}\right)}_{\text{evidence}} + \underbrace{c_{MPHF}}_{\approx 2\text{}4}
$$
### Empirical observation: m_u is set by De Bruijn graph topology, not partition count
Measured on *Betula nana* (k=31, m=11), summing n_kmers and sequence counts across all partition files:
| N partitions | m_sk | m_u | factor m_u/m_sk | nuc ratio (u/sk) |
|---|---|---|---|---|
| 1 | 12.13 | **41.89** | 3.45× | 0.273 |
| 16 | 12.13 | **38.19** | 3.15× | 0.376 |
| 256 | 12.13 | **37.90** | 3.12× | 0.388 |
| 1 024 | 12.13 | **37.89** | 3.12× | 0.389 |
- `m_sk` = avg kmers/super-kmer (invariant — same dataset regardless of partition scheme)
- `m_u` = avg kmers/unitig = total_n_kmers / total_unitigs, summed across all partitions
- `nuc ratio` = (u_symbols + 30·u_reads) / (sk_symbols + 30·sk_reads)
X-axis in both charts: partition bits (0 = 1 partition, 10 = 1024 partitions) — each step doubles the partition count.
```mermaid
xychart-beta
title "m_u (avg kmers/unitig) vs partition bits — B. nana k=31"
x-axis "partition bits" 0 --> 10
y-axis "m_u" 37 --> 43
line [41.89, 40.78, 39.22, 38.52, 38.19, 38.03, 37.96, 37.92, 37.90, 37.89, 37.89]
```
```mermaid
xychart-beta
title "Nucleotide storage: unitigs / super-kmers (%) vs partition bits — B. nana k=31"
x-axis "partition bits" 0 --> 10
y-axis "%" 25 --> 42
line [27.3, 29.7, 33.9, 36.3, 37.6, 38.3, 38.6, 38.7, 38.8, 38.9, 38.9]
```
Key observations:
1. **Partition boundaries have a small but non-zero effect on m_u.** Going from 1 to 1024 partitions reduces m_u by 10% (41.9 → 37.9). Within the practical range 161024, the variation is under 1% — m_u is effectively constant.
2. **m_u is a property of the De Bruijn graph, not the partition scheme.** The dominant factor is graph branching (heterozygosity, repeats, sequencing errors).
3. **Unitigs provide substantial compaction over super-kmers.** At 256 partitions, unitigs cover the same unique kmers using 39% of the raw nucleotide content of super-kmers (3.1× compaction factor).
#### Per-partition compaction ratio (sk_symbols / u_symbols)
The ratio measures how much super-kmer kmer-slots are "shared" across different super-kmer records: a ratio of 1.35 means each unique kmer (counted once in unitigs) appears in 1.35 super-kmer kmer-slots on average.
| bits | N partitions | median ratio | min ratio | min partition | min u_reads |
|---|---|---|---|---|---|
| 6 | 64 | 1.355 | 1.073 | — | 4.5 M |
| 7 | 128 | 1.352 | 1.037 | — | 4.1 M |
| 8 | 256 | **1.350** | **1.012** | **145** | **3.8 M** |
| 9 | 512 | 1.350 | 0.998 | 145 | 3.6 M |
| 10 | 1024 | 1.351 | 0.992 | 145 | 3.6 M |
The median stabilises at **1.35** from 64 partitions onward (stdev = 0.027 at 256 partitions). There is one persistent outlier: **partition 145** (at 256-partition resolution) is consistently anomalous across all partition depths — it contains 1014× more super-kmers and unitigs than the average partition, with a ratio near 1.0, meaning the unitig representation provides almost no kmer deduplication. This is consistent with a highly repetitive or organellar region where the dominant minimiser belongs to a sequence that appears in many reads without forming long overlapping paths in the De Bruijn graph.
Per-partition parameters at 256 partitions (*B. nana*):
| quantity | value |
|---|---|
| P (unique kmers/partition, avg) | ≈ 10.4 M |
| U (unitigs/partition, avg) | ≈ 275 k |
| m_u | ≈ 37.9 |
| Strategy A bits/kmer | ⌈log₂(P·(1+30/m_u))⌉ = 25 |
| Strategy B bits/kmer | ⌈log₂(U)⌉ + 8 = 27 |
Consequence: **the partition count should be as large as memory and parallelism allow.** Each doubling saves 1 bit/kmer in evidence (log₂ P decreases by 1). The sequence term 2·(1 + 30/m_u) ≈ 3.6 bits/kmer is approximately constant.
Strategy B partially decouples evidence cost from P: `log₂(U) = log₂(P/m_u)` grows more slowly than `log₂(P)` by a fixed log₂(m_u) ≈ 5 bits. Strategy B's main benefit remains locality and bounded rank width, not asymptotic compression.
--- ---
## Implementation notes ## Evidence bit-cost
### Evidence file layout (strategy B — implemented) Strategy B (chunk_id + rank) is the implemented strategy. For *B. nana* (k=31, 256 partitions, P ≈ 10.4 M unique kmers/partition, U ≈ 275 k chunks/partition, m_u ≈ 37.9 kmers/chunk):
`evidence.bin` is a flat `[u32; n]` array with no header: | field | theoretical cost | value |
|------------|-------------------------|---------|
| chunk_id | ⌈log₂ U⌉ | 19 bits |
| rank | ⌈log₂ m_u⌉ (≈ fixed) | 6 bits |
| **stored** | aligned u32 | **32 bits/slot** |
``` The u32 layout is chosen for alignment and simplicity; no bit-addressing arithmetic is needed.
evidence.bin: n × 4 bytes, little-endian
each u32: bits [31:7] = chunk_id (25 bits)
bits [6:0] = rank (7 bits)
```
File size = `n × 4` bytes exactly. Decoding a slot: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Comparison with strategy A (global nucleotide offset): `⌈log₂(P · (1 + (k1)/m_u))⌉ = 25 bits`. Strategy A is theoretically 2 bits cheaper; strategy B's advantage is **locality** (decoding touches one chunk's cache lines) and a bounded, constant-width rank field independent of partition size.
The theoretical bit cost of strategy B (19 bits id + 8 bits rank = 27 bits) is not recovered: packing into aligned u32 costs 32 bits per slot. The u32 layout is chosen for simplicity and alignment — one word per slot, no bit-addressing arithmetic.
### Unitig file layout
Binary packed 2-bit nucleotide file (`unitigs.bin`) with a companion index (`unitigs.bin.idx`). The index stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file.
### Decoding a kmer from slot s
```
(chunk_id, rank) = evidence.decode(s) // u32 → (u25, u7)
kmer = unitigs.raw_kmer(chunk_id, rank) // 2-bit packed slice, k nucleotides
```
Two memory accesses: one into `evidence.bin`, one into `unitigs.bin`. The canonical kmer is the stored sequence (by construction — only canonical kmers are inserted into the De Bruijn graph).
### Field widths in practice
Rank is stored in 7 bits (0127). On *B. nana* (k=31, m=11), the observed maximum unitig length is ~46 kmers/chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers per unitig; longer paths arise across multiple superkmers. u7 is sufficient.
chunk_id is stored in 25 bits (033 554 431). For *B. nana* at 256 partitions, avg U ≈ 275 k — well within the 25-bit capacity.
### Forward vs reverse complement
The De Bruijn graph stores only canonical kmers. The evidence encodes the canonical orientation. Callers that need the strand of the original kmer must compare the retrieved kmer with its revcomp at query time; this is a single 64-bit comparison.
--- ---
## Non-determinism of the unitig decomposition ## Unitig decomposition non-determinism
The unitig extraction is **not deterministic**: two runs on identical input can produce a different number of unitigs with different sequences, while covering exactly the same canonical k-mer set. The unitig extraction from `GraphDeBruijn` is **not deterministic**: two runs on identical input can produce different unitig counts and sequences while covering exactly the same canonical kmer set.
### Source of non-determinism The hash map (`hashbrown::HashMap` with `Xxh3Builder`) has run-dependent iteration order. The `start_iter` first pass emits every node where `can_extend_left` is false — this includes true dead-ends and branch points (nodes with ≥2 left neighbours). When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those upstream neighbours later produce length-k degenerate unitigs. When upstream neighbours appear first, they extend through the branch point.
The graph nodes are stored in a hash map whose iteration order depends on the hash seed (random per run with `ahash::RandomState::new()`). The `start_iter` first pass emits every node whose `can_extend_left` flag is false — which includes not only true dead-end nodes but also **branch points** (nodes with 2 or more left neighbours, for which `unique_neighbor` returns `None`).
When a branch point is encountered before its upstream neighbours, it claims the downstream chain and those neighbours later produce length-k degenerate unitigs. When upstream neighbours are encountered first, they extend through the branch point and consume it.
**Example** — fork topology (k = 31): **Example** — fork topology (k = 31):
@@ -248,25 +144,37 @@ A → B ← C
D D
``` ```
All four nodes are in the graph. B has two left neighbours (A and C), so `can_extend_left = false`; B also has one right neighbour D, so `can_extend_right = true`. B has two left neighbours, so `can_extend_left = false`. Two valid tilings:
| iteration order | unitigs produced | count | | iteration order | unitigs | count |
|---|---|---| |---|---|---|
| A first, then B, C | ABD · C | 2 | | A first | ABD, C | 2 |
| B first, then A, C | BD · A · C | 3 | | B first | BD, A, C | 3 |
Both tilings cover the same 4 canonical k-mers. Both cover the same 4 canonical kmers. Pure cycles are unaffected: all cycle nodes have both extensions present, so none are emitted in the first pass; each cycle produces exactly one unitig regardless of entry point (only the cut point varies).
Pure cycles (all nodes have both extensions present) are unaffected by this: they are never emitted in the first pass and each cycle produces exactly one unitig regardless of which node the second pass starts from. Only the cycle cut point (and therefore the sequence content) varies. This non-determinism is benign for MPHF construction: the MPHF is built from the kmer set, which is identical across tilings.
### Consequence for MPHF construction ---
The MPHF is built from the **k-mer set**, not from the unitig sequences themselves. Because both tilings contain the same canonical k-mers, the resulting MPHF is identical. The non-determinism is benign for this use case. ## Partition-size tradeoff
Measured on *B. nana* (k=31, m=11), summing across all partitions:
| N partitions | m_u |
|---|---|
| 1 | 41.89 |
| 16 | 38.19 |
| 256 | 37.90 |
| 1 024 | 37.89 |
`m_u` is set by De Bruijn graph topology (heterozygosity, repeats, sequencing errors), not partition count. The variation from 1 to 1024 partitions is under 10%; within 161024 it is under 1%. Unitigs provide ~3.1× nucleotide compaction over super-kmers at 256 partitions.
Evidence cost decreases by 1 bit/kmer with each doubling of partition count (via `log₂ U = log₂(P/m_u)`). The sequence storage term `2 · (1 + (k1)/m_u) ≈ 3.6 bits/kmer` is approximately constant.
--- ---
## Open questions ## Open questions
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u. - **Cross-partition set operations**: strategy B allows unitig-level operations (mark entire chunks present/absent) rather than kmer-level, reducing cost by a factor of m_u.
- **Eliminating evidence.bin**: at ~66% of per-layer lookup footprint, `evidence.bin` dominates index size. See [Evidence elimination design discussion](evidence_elimination.md).
- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md).