From da56c3e2907f6df1686c391b6e60b440338e10fd Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Sat, 23 May 2026 13:24:25 +0200 Subject: [PATCH] docs: update architecture and storage specs for approximate index Restructure architecture documentation to reflect the decoupled `MphfLayer` design wrapped by `LayeredStore` and enforce strict multi-genome column invariants. Introduce the approximate index architecture, replacing exact `evidence.bin` with compact `fingerprint.bin` using B-bit fingerprints and z-consecutive k-mer matching. Update CLI flags, add `reindex`/`estimate` workflows, and refactor APIs to support separate exact/approximate evidence handling. Finally, provide a comprehensive on-disk layout specification, including the pipeline state machine, JSON schemas, binary formats, and refined Strategy B unitig evidence details. --- docmd/architecture/index_architecture.md | 491 +++++++++---------- docmd/implementation/evidence_elimination.md | 417 ++++++---------- docmd/implementation/obilayeredmap.md | 345 ++++++++----- docmd/implementation/storage.md | 137 +++++- docmd/implementation/unitig_evidence.md | 306 ++++-------- 5 files changed, 814 insertions(+), 882 deletions(-) diff --git a/docmd/architecture/index_architecture.md b/docmd/architecture/index_architecture.md index 0915a9c..752334a 100644 --- a/docmd/architecture/index_architecture.md +++ b/docmd/architecture/index_architecture.md @@ -2,169 +2,155 @@ ## Fundamental invariant -A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. +A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` -PartitionedIndex -├── LayeredPartition (one per minimiser bucket) -│ ├── MphfLayer 0 kmer → slot (immutable bijection) -│ │ ├── DataStore A slot → T (e.g. counts) -│ │ └── DataStore B slot → T (e.g. presence/absence, derived) -│ ├── MphfLayer 1 -│ │ └── DataStore A -│ └── ... -├── LayeredPartition -│ └── ... +KmerIndex (index.meta + KmerPartition) +├── partition_0/index/ one directory per minimiser bucket +│ ├── meta.json PartitionMeta { n_layers } +│ ├── layer_0/ +│ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind } +│ │ ├── mphf.bin PtrHash MPHF +│ │ ├── unitigs.bin unitig spine (never overwritten) +│ │ ├── evidence.bin exact evidence (Exact only) +│ │ ├── unitigs.bin.idx block index (Exact only) +│ │ ├── fingerprint.bin fingerprints (Approx only) +│ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true) +│ │ └── presence/ PersistentBitMatrix +│ └── layer_1/ +│ └── ... +└── partition_1/index/ + └── ... ``` -**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. +**KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon. -**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. +**Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets. -**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. - -**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`. +**Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`. --- -## MphfLayer — autonomous mapping layer +## IndexConfig and IndexMeta ```rust -MphfLayer::find(kmer: CanonicalKmer) -> Option // slot, or None if absent -MphfLayer::n() -> usize // number of slots -MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin -MphfLayer::open(dir: &Path) -> OLMResult +pub struct IndexConfig { + pub kmer_size: usize, + pub minimizer_size: usize, + pub n_bits: usize, // log2(n_partitions) + pub with_counts: bool, + pub evidence: EvidenceKind, + pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig +} + +pub struct IndexMeta { + pub version: u32, + pub config: IndexConfig, + pub genomes: Vec, // ordered; index = genome column number +} + +pub struct GenomeInfo { + pub label: String, + pub meta: HashMap, // arbitrary categorical metadata +} ``` -`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. +`IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index. -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. +--- + +## EvidenceKind + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +Controls which files are written per layer and which query path is taken: + +| Variant | Files written | False-positive rate | +|---|---|---| +| `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 | +| `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) | + +`EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built). + +--- + +## MphfLayer — autonomous kmer → slot mapping + +```rust +pub struct MphfLayer { + mphf: PtrHash<…>, + ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint } + n: usize, +} +``` + +`MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range. + +``` +find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs +find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer) +``` + +`block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`. + +The MPHF and `unitigs.bin` are never rebuilt by any post-build operation. + +--- + +## Layer\ — MPHF + data payload + +```rust +pub struct Layer { + mphf: MphfLayer, + data: D, +} +``` + +`D` selects the attached data payload: + +| `D` | Data directory | `Item` returned by `query` | +|---|---|---| +| `()` | — | `()` (set membership only) | +| `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) | +| `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) | + +`Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option>`. + +Build-time entry points: + +```rust +Layer<()>::build(out_dir, block_bits) // set membership +Layer::build(out_dir, block_bits, count_of) +Layer::build_presence(out_dir, block_bits, n_genomes, present_in) + +Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path) +``` --- ## DataStore — slot-indexed data -```rust -trait DataStore { - type Item; - fn get(&self, slot: usize) -> Self::Item; - fn n(&self) -> usize; -} -``` +`PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs. -Concrete types from `obicompactvec`: - -| Type | `Item` | Column stats | Use | +| Type | `Item` | Aggregation method | Use | |---|---|---|---| -| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | -| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | - -`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. - -A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. +| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1` | counts per genome per slot | +| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1` | presence per genome per slot | --- -## Distance matrix API on DataStore types +## Aggregation traits — `obicompactvec::traits` -Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. - -### Full distance matrices - -Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. - -```rust -// PersistentCompactIntMatrix -fn bray_dist_matrix(&self) -> Array2 -fn relfreq_bray_dist_matrix(&self) -> Array2 -fn euclidean_dist_matrix(&self) -> Array2 -fn relfreq_euclidean_dist_matrix(&self) -> Array2 -fn hellinger_dist_matrix(&self) -> Array2 -fn jaccard_dist_matrix(&self) -> Array2 -fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 - -// PersistentBitMatrix -fn jaccard_dist_matrix(&self) -> Array2 -fn hamming_dist_matrix(&self) -> Array2 -``` - -These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. - -### Partial distance matrices - -Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. - -**Category 1 — self-contained partials**: additive without any external parameter. - -```rust -// PersistentCompactIntMatrix -fn partial_bray_dist_matrix(&self) - -> (Array2, // sum_min[i,j] - Array1) // col_sums[k] - -fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs -fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) - -> (Array2, // inter[i,j] - Array2) // union[i,j] - -// PersistentBitMatrix -fn partial_jaccard_dist_matrix(&self) - -> (Array2, // inter[i,j] - Array2) // union[i,j] -fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits -``` - -**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. - -```rust -// PersistentCompactIntMatrix only -fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) - -fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² - -fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) - -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² -``` - -The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. - ---- - -## Progressive aggregation principle - -Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. - -``` -PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix - ↓ Σ across layers -LayeredStore::col_weights() — column sums for one partition - ↓ Σ across partitions -LayeredStore>::col_weights() — global column sums -``` - -The same cascade applies to every partial: - -``` -PersistentCompactIntMatrix::partial_bray() — one (partition, layer) - ↓ element-wise Σ across layers -LayeredStore::partial_bray() — one partition - ↓ element-wise Σ across partitions -LayeredStore>::partial_bray() — global partial → final dist -``` - -Each level presents a stable trait surface to the level above; no level reaches two levels down. - ---- - -## Traits — `obicompactvec::traits` - -Three traits unify the aggregation API across all levels of the hierarchy. +Three traits unify the aggregation API across all hierarchy levels. ```rust trait ColumnWeights: Send + Sync { @@ -172,81 +158,130 @@ trait ColumnWeights: Send + Sync { } trait CountPartials: ColumnWeights { - // self-contained partials (additive, no parameter) - fn partial_bray(&self) -> Array2; - fn partial_euclidean(&self) -> Array2; - fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); - // normalised partials (global col_weights passed in cascade) - fn partial_relfreq_bray(&self, global: &Array1) -> Array2; - fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; - fn partial_hellinger(&self, global: &Array1) -> Array2; - // provided finalisation methods (default implementations) - fn bray_dist_matrix(&self) -> Array2 { … } - fn euclidean_dist_matrix(&self) -> Array2 { … } - fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 { … } - fn relfreq_bray_dist_matrix(&self) -> Array2 { … } - fn relfreq_euclidean_dist_matrix(&self) -> Array2 { … } - fn hellinger_dist_matrix(&self) -> Array2 { … } + fn partial_bray(&self) -> Array2; + fn partial_euclidean(&self) -> Array2; + fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); + fn partial_relfreq_bray(&self, global: &Array1) -> Array2; + fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; + fn partial_hellinger(&self, global: &Array1) -> Array2; + // provided finalisation methods with default impls + fn bray_dist_matrix(&self) -> Array2 { … } + fn relfreq_bray_dist_matrix(&self) -> Array2 { … } + // … } trait BitPartials: ColumnWeights { - fn partial_jaccard(&self) -> (Array2, Array2); - fn partial_hamming(&self) -> Array2; + fn partial_jaccard(&self) -> (Array2, Array2); + fn partial_hamming(&self) -> Array2; // provided fn jaccard_dist_matrix(&self) -> Array2 { … } fn hamming_dist_matrix(&self) -> Array2 { … } } ``` -**Leaf implementors** (in `obicompactvec`): +Leaf implementors: | Type | Traits | |---|---| -| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` | -| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` | - -`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators. +| `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` | +| `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` | --- -## `LayeredStore` — `obilayeredmap` - -A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types: +## LayeredStore\ — recursive aggregation wrapper ```rust pub struct LayeredStore(Vec); ``` -Three blanket impls propagate the traits up the hierarchy: +Three blanket impls propagate all traits up the hierarchy: ```rust -impl ColumnWeights for LayeredStore { … } // Σ across inner stores -impl CountPartials for LayeredStore { … } // same pattern -impl BitPartials for LayeredStore { … } // same pattern +impl ColumnWeights for LayeredStore { … } +impl CountPartials for LayeredStore { … } +impl BitPartials for LayeredStore { … } ``` -Because the blanket impl is recursive, **`LayeredStore>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed: +This makes `LayeredStore>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed: ``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (= one partition) -LayeredStore> via blanket impl (= partitioned index) +PersistentCompactIntMatrix leaf (one layer) +LayeredStore one partition (layers are disjoint) +LayeredStore> whole index (partitions are independent) ``` -### Normalised metrics — two-pass cascade - -The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic: +Normalised metrics require global column sums — computed in a two-pass cascade: ```rust -// called on LayeredStore> +// on LayeredStore> fn relfreq_bray_dist_matrix(&self) -> Array2 { - let global = self.col_weights(); // pass 1 — progressive sum at every level - let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade - p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately) + let global = self.col_weights(); // pass 1 — sums up hierarchy + let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only + p.mapv(|v| 1.0 - v) } ``` -`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy. +Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy. + +--- + +## Progressive aggregation principle + +No level reaches two levels down. Each level sums contributions from the level immediately below: + +``` +PersistentCompactIntMatrix::col_weights() — one (partition, layer) + ↓ Σ across layers +LayeredStore::col_weights() — one partition + ↓ Σ across partitions +LayeredStore>::col_weights() — global +``` + +The same cascade applies to every partial method. + +--- + +## Multi-genome column invariant + +After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. + +Maintained by three coordinated operations: + +**Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`. + +**New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers. + +**First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1. + +This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape. + +--- + +## Query model + +### Point query + +``` +minimiser(kmer) → partition p +for each layer l in p: + if let Some(slot) = MphfLayer_l.find(kmer): + return data_l.read(slot) +return None +``` + +O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`. + +### Aggregation + +``` +result = reduce( + for p in partitions: // parallel + for l in layers(p): // parallel + partial(data_p_l) +) +``` + +For normalised metrics, replace with the two-pass cascade. --- @@ -254,103 +289,25 @@ fn relfreq_bray_dist_matrix(&self) -> Array2 { | Level | Unit | Coordination | |---|---|---| -| Across partitions | `LayeredStore>` inner stores | none — fully independent | -| Across layers within a partition | `LayeredStore` inner stores | none — disjoint kmer sets | +| Across partitions | inner stores of `LayeredStore>` | none | +| Across layers within a partition | inner stores of `LayeredStore` | none — disjoint kmer sets | | Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | -All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction. +--- + +## reindex — evidence conversion in place + +`KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`: + +- `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin` +- `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx` + +On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`. --- -## Query model +## estimate — parameter dry-run -### Point query — `kmer → Option` - -``` -minimiser(kmer) → partition p -for each layer l in p: - slot = MphfLayer_l.find(kmer) - if slot is Some: - return DataStore_l.get(slot) -return None -``` - -O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). - -### Aggregation — `→ Result` - -``` -result = reduce( - for p in partitions: // parallel - for l in layers(p): // parallel - partial(DataStore_p_l) -) -``` - -For normalised metrics replace with the two-pass scheme above. - ---- - -## DataStore derivation - -Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF: - -``` -// count → presence/absence, parallel across (partition, layer) -for (p, l) in all_partition_layer_pairs().par_iter(): - count_store = open PersistentCompactIntMatrix at (p, l) - presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) -``` - -Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. - ---- - -## Relationship to current implementation - -### What is implemented - -- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`. -- **`obilayeredmap::LayeredStore`**: generic wrapper with blanket impls for all three traits. `LayeredStore>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data. - -### What is not yet implemented - -- `Layer` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported. -- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`. -- No `PartitionedIndex` type for point queries with parallel partition dispatch. - -### Planned refactoring - -1. Extract `MphfLayer` from `Layer` as an autonomous type. -2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system. -3. Rewire `LayeredMap` to hold `LayeredStore` (or bit variant) alongside the MPHF layers. -4. Implement `PartitionedIndex` using `LayeredStore>` for data and parallel dispatch for queries. - ---- - -## Multi-genome column invariant - -### The invariant - -After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode). - -### How it is maintained - -The invariant is established and preserved by three coordinated operations: - -**1. Existing layers — column append.** -When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns. - -**2. New layers — absent columns prepended.** -If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation. - -**3. First merge, Presence mode — `init_presence_matrix`.** -The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2. - -### Why the invariant is required - -The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics. - -The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape. +`estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing. diff --git a/docmd/implementation/evidence_elimination.md b/docmd/implementation/evidence_elimination.md index ecc2c6f..a7f3a18 100644 --- a/docmd/implementation/evidence_elimination.md +++ b/docmd/implementation/evidence_elimination.md @@ -1,321 +1,178 @@ -# Evidence elimination — design discussion +# Approximate evidence: fingerprint-based index -## Problem statement +## Motivation -`evidence.bin` maps each MPHF slot to a position in the unitig store so that -query verification is possible: given a slot `s` returned by `mphf.index(kmer)`, -retrieve the k-mer stored at that position and compare with the query. +`evidence.bin` maps each MPHF slot to the position of the k-mer that owns it, +enabling zero-FP verification. On the bacterial BCT dataset (2048 partitions, +k=31, ~33 M k-mers/partition) it accounts for 66 % of the lookup-layer footprint: -On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition): - -| file | size/partition | total (2048 parts) | fraction of lookup layer | -|---|---|---|---| -| evidence.bin | 132 MB | ~270 GB | **66 %** | -| unitigs.bin | 58 MB | ~118 GB | 29 % | -| mphf.bin | 10 MB | ~20 GB | 5 % | - -Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage -optimisation available for index size. - ---- - -## Why evidence exists - -PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`. -For a query k-mer not in the indexed set, the returned slot is meaningless but -indistinguishable from a real hit without external information. Evidence provides -that information: `evidence[s]` encodes the location of the k-mer that legitimately -occupies slot `s`, allowing the verification: - -``` -slot = mphf.index(query) -(chunk_id, rank) = evidence.decode(slot) -stored_kmer = unitigs.kmer_at(chunk_id, rank) -return canonical(stored_kmer) == canonical(query) -``` - -Evidence is a **permutation** from MPHF-space to unitig-position-space. -Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding. - ---- - -## Information-theoretic lower bound - -For a partition with P k-mers and U unitigs of average length m_u k-mers: - -- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits -- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits - -Current implementation: 25 + 7 = 32 bits (aligned u32). -Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M. - -**Packing headroom: ~22 %.** Not a path to elimination. - ---- - -## Two-index architecture - -The exact index is mandatory for set operations (union, intersection, diff) and -exact k-mer retrieval. A separate approximate index, built for query operations, -can tolerate a controlled false positive rate in exchange for a much smaller -footprint. - -| component | exact index | approximate index | +| file | size/partition | fraction | |---|---|---| -| `mphf.bin` | ✓ | ✓ (same structure) | -| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ | -| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) | -| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) | -| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) | +| evidence.bin | 132 MB | 66 % | +| unitigs.bin | 58 MB | 29 % | +| mphf.bin | 10 MB | 5 % | -The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps -`unitigs.bin` for sequential enumeration of K-mers. +`evidence.bin` is a bijection from MPHF-space to unitig-position-space and +costs at minimum ⌈log₂ N⌉ bits per slot — an information-theoretic floor with +only ~22 % packing headroom. Compression is not a path to elimination. + +The approximate index replaces `evidence.bin` + `unitigs.bin.idx` with a +`fingerprint.bin` file. The MPHF and `unitigs.bin` are kept unchanged. Set +operations still require an exact index; the approximate index targets query +workloads that can tolerate a bounded false-positive rate. --- -## MPHF as a perfect Bloom filter +## The Findere model -A standard Bloom filter with a single hash function and N bits storing M keys has -occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of -landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what -rejects foreign k-mers. +A B-bit fingerprint stored per MPHF slot provides the discrimination that +`evidence.bin` would otherwise provide through full k-mer reconstruction. -An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer -occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any** -input to a slot in [0, M) — there is no empty space. Every query lands on an -occupied slot. The MPHF alone cannot reject foreign k-mers at all. - -Adding a B-bit fingerprint restores the discrimination: +For a foreign k-mer query, the MPHF maps it to some slot `s`. The fingerprint +stored at `s` belongs to the legitimate k-mer at that slot. The FP event is: ``` -slot = mphf.index(query) -fingerprint = hash(query) & mask_B -present = fingerprint_table[slot] == fingerprint +P(FP per k-mer) = 1 / 2^b ``` -The fingerprint plays the role of the sparse space in the Bloom filter: it provides -the B bits of information needed to reject foreign k-mers. +The Findere trick raises the effective window to z consecutive k-mers. A query +succeeds only when all z fingerprint checks pass, reducing the per-window FP rate: -Both structures reach the same fundamental cost for a given FP rate. For 1% FP: +``` +P(FP per z-window) = 1 / 2^(b·z) +``` -- Bloom filter (optimal, k hash functions): ~9.6 bits/key -- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key +The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer +decomposes into z overlapping k-mers, all of which must match. -This is a fundamental bound, not an implementation detail. +Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`). --- -## Approach A — MPHF + fingerprint (approximate index) +## `FingerprintVec` on disk -### Size +`fingerprint.bin` layout: -| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) | +``` +magic: b"FPVF" (4 bytes) +b: u8 (bits per slot, 1..=64) +padding: [0u8; 3] +n: u64 LE (number of slots) +data: packed bits, ceil(n·b/8) bytes, Lsb0 order +``` + +`FingerprintVec` is memory-mapped. The match check against a query k-mer: + +```rust +fn matches(&self, slot: usize, fingerprint: u64) -> bool { + self.get(slot) == (fingerprint & self.mask) +} +``` + +`build_approx_evidence` iterates `unitigs.bin` sequentially, writes +`kmer.seq_hash()` into the slot assigned by the MPHF, then saves `fingerprint.bin` +and `layer_meta.json`. No `.idx` file is produced; random access into +`unitigs.bin` is not needed. + +At build time, `find_approx` in `MphfLayer`: + +```rust +let slot = self.mphf.index(&kmer.raw()); +if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None } +``` + +--- + +## `EvidenceKind` and metadata + +`layer_meta.json` records which evidence bundle is present: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`MphfLayer::open` reads this tag and dispatches `find` to `find_exact` or +`find_approx` transparently. `find_exact` panics on an approximate layer; +`find_approx` panics on an exact layer — mode mixing is a programming error. + +--- + +## Parameter resolution (`resolve_approx_params`) + +The identity `b·z = ⌈−log₂(fp)⌉` lets any two of (b, z, fp) derive the third. +`resolve_approx_params` implements a 2-of-3 rule with conservative ceiling +rounding: + +| given | derived | +|---|---| +| b, z | fp = 1/2^(b·z) | +| z, fp | b = ⌈−log₂(fp) / z⌉ | +| b, fp | z = ⌈−log₂(fp) / b⌉ | +| z only | b = 8 (default), fp derived | +| b only | z = 1 (default), fp derived | +| fp only | b = 8 (default), z derived | +| none | b = 8, z = 1, fp = 1/256 | + +When all three are given, b and z are authoritative and fp is recomputed. + +--- + +## CLI flags + +Both `index` and `reindex` accept the same flags: + +| flag | type | meaning | |---|---|---| -| 8 | 33 MB | 4× smaller | -| 12 | 49 MB | 2.7× smaller | -| 16 | 66 MB | 2× smaller | +| `--approx` | bool | enable fingerprint evidence | +| `--evidence-bits` (`b`) | u8 | fingerprint bits per slot | +| `-z` | u8 | Findere z parameter | +| `--fp` | f64 | target FP rate per z-window | +| `--block-size` | usize | unitig block size for exact `.idx`; ignored in approx mode | -Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer). - -### False positive rate — per k-mer query - -For a specific non-indexed query k-mer q: - -1. MPHF(q) → slot s, some value in [0, M) -2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s -3. FP event: hash(q) & mask_B == fingerprint_table[s] - -Since q is not the legitimate k-mer at s, its fingerprint is independent of -fingerprint_table[s], giving: - -``` -P(FP per k-mer) = 1 / 2^B -``` - -This is the probability of error **for one specific query k-mer**. It is not the -fraction of the k-mer universe that would be misclassified: querying all 4^k -possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but -that is not the relevant quantity for practical use. - -### Equivalence classes - -The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B -equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true -indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions) -produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B -remains constant. - -### Read-level use case - -The relevant decision unit is the **read**, not the individual k-mer. For a read -of ~100 nucleotides and k=31, there are ~70 k-mers. - -- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are - true positives → high coverage fraction. -- A plant read queried against a bacterial index: k-mers are foreign; each has - P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B. - -A coverage threshold separates the two cases decisively. This is the same -principle as Findere: local coverage continuity distinguishes true hits from noise. +`--approx` must be set explicitly; the other three flags are optional and +resolved by the 2-of-3 rule. Omitting all three produces b=8, z=1. --- -## Approach B — z-consecutive k-mer matching +## `reindex` command -A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers. -Declaring a match only when **all z are present** reduces the per-window FP rate: +`reindex` converts an existing index between exact and approximate evidence +in-place across all partitions and layers, running partitions in parallel via +Rayon. -``` -P(FP per window of z) = (1/2^B)^z = 1/2^(B·z) -``` +Conversion to approximate (`--approx`): -For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z. -The probability that at least one window is a false positive: +- Builds `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. +- Removes `evidence.bin` and `unitigs.bin.idx`. +- Updates `layer_meta.json` with `EvidenceKind::Approx { b, z }`. -``` -P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z) -``` +Conversion to exact (default, no `--approx`): -For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸. +- Builds `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. +- Removes `fingerprint.bin`. +- Updates `layer_meta.json` with `EvidenceKind::Exact`. -A plant read is misclassified as bacterial roughly once in 60 million reads — -negligible for any practical dataset. - -### Choosing B from (z, L, P_target) - -z is a query-time parameter and does not affect the index structure. However, -knowing z at build time allows computing the minimum B required to reach a target -FP rate P_target for reads of length L (giving W = L − k − z + 2 independent -windows): - -``` -P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z ) -``` - -Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**. - -(B, z) are co-determined at build time to minimise fingerprint size while -guaranteeing the target read-level FP rate. - -### Combined sizing - -| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition | -|---|---|---|---|---| -| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB | -| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB | -| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB | -| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB | - -Smaller B → smaller fingerprint table; larger z → longer minimum match length K -and fewer independent windows per read. +The root `index.meta` is updated with the new evidence kind on success. +`mphf.bin` and `unitigs.bin` are never modified. --- -## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index) +## `estimate` command -Build the MPHF to output the global k-mer position directly: +`estimate` is a dry-run that resolves and prints (b, z, fp) without touching +any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and +additionally accepts `-k` to display the effective indexed k-mer length: ``` -mphf: kmer → global_pos ∈ [0, P) +k (query): 31 +k (indexed): 31 +z: 1 +evidence bits (b): 8 +FP per k-mer: 3.906e-3 (1/2^8) +FP per z-window: 3.906e-3 (1/2^8) ``` -Verification becomes: - -``` -global_pos = mphf.index(query) -stored_kmer = unitigs.kmer_at_global_pos(global_pos) -return canonical(stored_kmer) == canonical(query) -``` - -No evidence array. The unitig block index (see below) provides -`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time. - -### What is required - -A **retrieval data structure** (also called a value-based or function-based MPHF): -given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`, -build a compact structure that maps each key to its assigned value. - -Known constructions: - -- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph + - XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic - minimum. Construction: O(n). Query: O(1). -- **SSHash approach**: builds the MPHF to map k-mers to their positions in a - concatenated unitig string. Achieves elimination of external evidence using a - "skew" construction that aligns the MPHF output with the sequential unitig layout. - -### Rust availability - -No Rust crate implements a retrieval data structure suitable for this use case as -of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain -MPHFs. **This is the key blocking factor.** - -### SSHash construction (reference) - -SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over -(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential -positions in the concatenated unitig string. A port to obikmer would require: - -1. Concatenating all unitig sequences into a single flat buffer per partition. -2. Assigning each k-mer a global position (its offset in that buffer). -3. Building the MPHF to output that position directly (retrieval step). -4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`. - ---- - -## Approach 2 — block index prefix sums (reduces evidence to negligible) - -A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled -offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset -table. - -### Extension: k-mer prefix sums per block - -Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before -block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**. - -This enables `kmer_at_global_pos(p)`: - -1. Binary search in `kmer_prefix[]` to find block `b`. -2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`. -3. Extract the k-mer at the remaining rank within the found chunk. - -Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses. - -### Combined with Approach 1 - -- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions) -- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible) - ---- - -## Recommended path - -1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose - (B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep - `unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8 - vs ~142 MB for the exact lookup layer. - -2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`. - Zero cost if evidence.bin is kept; enables Approach 1 when ready. - -3. **Medium term**: implement GOV retrieval data structure in Rust, or port - SSHash construction. - -4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected - index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset. - ---- - -## Open questions - -- Is a GOV construction compatible with the parallel MPHF build currently used - (PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential - (hypergraph peeling); parallelisation is non-trivial. -- Can the SSHash "skew" insight be reused without the minimizer-bucket structure? - The obikmer partitioning already uses minimizers — there may be natural alignment. -- What is the query latency impact of replacing O(1) evidence lookup with - O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale. -- What is the optimal (B, z) trade-off for the approximate index given the target - read length and acceptable P(FP_read)? +Useful for choosing parameters before committing to an index build. diff --git a/docmd/implementation/obilayeredmap.md b/docmd/implementation/obilayeredmap.md index d5ad7d2..da2f42e 100644 --- a/docmd/implementation/obilayeredmap.md +++ b/docmd/implementation/obilayeredmap.md @@ -2,7 +2,7 @@ ## Purpose -`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. +`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- @@ -20,42 +20,81 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico --- +## Evidence kinds + +Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time: + +```rust +pub enum EvidenceKind { + Exact, + Approx { b: u8, z: u8 }, +} +``` + +`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load. + +- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time. +- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required. + +--- + ## MphfLayer — autonomous kmer → slot mapping -`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data. +`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload. ```rust pub struct MphfLayer { - mphf: Mphf, - evidence: Evidence, - unitigs: UnitigFileReader, - n: usize, // number of indexed kmers = number of MPHF slots + mphf: Mphf, + ev: LayerEvidence, // loaded at open() time + n: usize, } ``` -Public API: +`LayerEvidence` is an internal enum, not public: ```rust -impl MphfLayer { - pub fn open(dir: &Path) -> OLMResult - pub fn find(&self, kmer: CanonicalKmer) -> Option // Some(slot) or None - pub fn n(&self) -> usize - pub fn unitig_writer(dir: &Path) -> OLMResult - pub(crate) fn build( - dir: &Path, - fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, - ) -> OLMResult +enum LayerEvidence { + Exact { evidence: Evidence, unitigs: UnitigFileReader }, + Approx { fingerprint: FingerprintVec }, } ``` -`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test). +### Query API -`build` runs two sequential passes over `unitigs.bin`: +Three public query methods, all returning `Option` (slot index): -1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. -2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. +```rust +pub fn find(&self, kmer: CanonicalKmer) -> Option +pub fn find_exact(&self, kmer: CanonicalKmer) -> Option +pub fn find_approx(&self, kmer: CanonicalKmer) -> Option +``` -For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`. +- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`. +- `find_exact` panics if the layer holds approximate evidence; zero false positives. +- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer. + +`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes. + +### Build surface + +```rust +// Full MPHF + exact evidence build (two-pass, parallel) +pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult + +// Evidence-only builds (MPHF already present in dir) +pub fn build_exact_evidence(dir, block_bits) -> OLMResult +pub fn build_approx_evidence(dir, b, z) -> OLMResult +pub fn build_evidence(dir, kind, block_bits) -> OLMResult // dispatch +``` + +`MphfLayer::build` runs two sequential passes over `unitigs.bin`: + +1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. +2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. + +`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction. + +For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files. --- @@ -81,7 +120,7 @@ pub struct Hit { } ``` -`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait. +`LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait. | Type | `Item` | Description | |---|---|---| @@ -89,31 +128,118 @@ pub struct Hit { | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | -**Build signatures:** +### Build signatures ```rust // mode 1 impl Layer<()> { - pub fn build(out_dir: &Path) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult } // mode 2 impl Layer { - pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult - pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult + pub fn build(out_dir: &Path, block_bits: u8, + count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn build_from_map(out_dir: &Path, block_bits: u8, + counts: &HashMap) -> OLMResult } // mode 3 impl Layer { - pub fn build_presence( - out_dir: &Path, - n_genomes: usize, - present_in: impl Fn(CanonicalKmer, usize) -> bool, - ) -> OLMResult + pub fn build_presence(out_dir: &Path, block_bits: u8, + n_genomes: usize, + present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult } ``` -All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`. +All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`. + +### Evidence build helpers on Layer + +```rust +impl Layer { + pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult + pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult + pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult +} +``` + +These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer` level. + +--- + +## FingerprintVec and FingerprintVecWriter + +Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot. + +``` +fingerprint.bin format: + magic: b"FPVF" (4 bytes) + b: u8 (bits per fingerprint, 1..=64) + padding: [0u8; 3] + n: u64 LE (number of slots) + data: packed bits, ceil(n*b/8) bytes, Lsb0 order +``` + +```rust +impl FingerprintVec { + pub fn open(path: &Path) -> OLMResult + pub fn get(&self, slot: usize) -> u64 + pub fn matches(&self, slot: usize, fingerprint: u64) -> bool + pub fn n(&self) -> usize + pub fn b(&self) -> u8 +} +``` + +`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`. + +--- + +## LayeredMap\ — collection of layers + +`LayeredMap` wraps `Vec>` for a single partition directory. + +```rust +pub struct LayeredMap { + root: PathBuf, + meta: PartitionMeta, + layers: Vec>, +} +``` + +`PartitionMeta` (`meta.json` at the partition root) stores `n_layers`. + +### Common methods + +```rust +pub fn open(root: &Path) -> OLMResult +pub fn create(root: &Path) -> OLMResult +pub fn n_layers(&self) -> usize +pub fn layer(&self, i: usize) -> &Layer +pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit)> +pub fn next_layer_writer(&self) -> OLMResult +``` + +`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0. + +### push_layer + +`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`: + +```rust +// mode 1 +impl LayeredMap<()> { + pub fn push_layer(&mut self) -> OLMResult +} + +// mode 2 +impl LayeredMap { + pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult + pub fn push_layer_from_map(&mut self, counts: &HashMap) -> OLMResult +} +``` + +Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer::build_presence`. --- @@ -131,14 +257,6 @@ impl BitPartials for LayeredStore { … } // element-wi Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. -**Aggregation hierarchy:** - -``` -PersistentCompactIntMatrix implements CountPartials -LayeredStore via blanket impl (one partition) -LayeredStore> via blanket impl (partitioned index) -``` - **Leaf implementors** (in `obicompactvec`): | Type | Traits | @@ -146,8 +264,6 @@ LayeredStore> via blanket impl (partitione | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | -`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators. - See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. --- @@ -155,34 +271,30 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful ## On-disk structure ``` -index_root/ ← LayeredMap (collection) - meta.json - part_00000/ ← Partition - layer_0/ ← Layer - mphf.bin — ptr_hash MPHF (epserde format) - unitigs.bin — packed 2-bit nucleotide sequences - unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] - evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE - counts/ [mode 2] PersistentCompactIntMatrix - meta.json {"n": N, "n_cols": 1} - col_000000.pciv - presence/ [mode 3] PersistentBitMatrix - meta.json {"n": N, "n_cols": G} - col_000000.pbiv - … - layer_1/ - … - part_00001/ +partition_root/ ← LayeredMap (one partition) + meta.json — {"n_layers": N} + layer_0/ ← Layer + layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z} + mphf.bin — ptr_hash MPHF (epserde format) + unitigs.bin — packed 2-bit nucleotide sequences + unitigs.bin.idx — UIDX index (exact evidence only) + evidence.bin — [u32; n], LE (exact evidence only) + fingerprint.bin — packed b-bit array (approx evidence only) + counts/ [mode 2] PersistentCompactIntMatrix + meta.json + col_000000.pciv + presence/ [mode 3] PersistentBitMatrix + meta.json + col_000000.pbiv … + layer_1/ … ``` -**Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel. - -**Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint. +`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction. --- -## Evidence encoding +## Evidence encoding (exact) `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: @@ -191,9 +303,9 @@ bits [31:7] = chunk_id (25 bits) — index of the unitig chunk bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) ``` -Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`. +`chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access). -For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer. +For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. --- @@ -203,7 +315,7 @@ For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 - CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano) + CachelineEfVec>, // remap: Elias-Fano Xx64, // hasher: XXH3-64 with seed Vec, // pilots >; @@ -211,21 +323,41 @@ type Mphf = PtrHash< `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. -`CubicEps` with `PtrHashParams::::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. +`CubicEps` with `PtrHashParams::::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space. --- -## Query path +## Column append and merge support + +These methods extend existing layers with new genome columns without touching the MPHF. + +### Layer-level genome column append ```rust -pub fn query(&self, kmer: CanonicalKmer) -> Option> { - self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) }) +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> +} + +impl Layer { + pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> } ``` -`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit. +Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n). -In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0. +### Presence matrix initialisation + +```rust +impl Layer<()> { + pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> +} +``` + +Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended. + +### Why the MPHF is never rebuilt + +The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. --- @@ -235,9 +367,9 @@ When adding dataset B to an existing index: 1. For each partition, probe existing layers for kmers of B routed to that partition. 2. Collect kmers absent from all layers → `B \ index`. -3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`. -4. Call `Layer::build` on the new directory. -5. Update `meta.json`. +3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`. +4. Call `Layer::build` (or `build_presence`) on the new layer directory. +5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`. Each partition's new layer is built independently; the operation is fully parallel across partitions. @@ -250,62 +382,9 @@ Each partition's new layer is built independently; the operation is fully parall | `ptr_hash 1.1` | MPHF per layer | | `cacheline-ef 1.1` | compact remap inside ptr_hash | | `epserde 0.8` | zero-copy MPHF serialisation | -| `memmap2 0.9` | mmap of evidence and payload files | -| `obiskio` | unitig file writer/reader | +| `memmap2 0.9` | mmap of evidence and fingerprint files | +| `bitvec` | packed b-bit fingerprint storage | +| `obiskio` | unitig file writer/reader + `.idx` build | | `obicompactvec` | payload types + aggregation traits | | `rayon 1` | parallel MPHF construction pass | -| `ndarray 0.16` | aggregation output arrays | - ---- - -## Column append and merge support - -These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command. - -### Matrix column append - -```rust -impl PersistentCompactIntMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> -} - -impl PersistentBitMatrix { - pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> -} -``` - -Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change. - -### Presence matrix initialisation - -```rust -impl Layer<()> { - pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> -} -``` - -Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended. - -### Layer-level genome column append - -```rust -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> bool, - ) -> OLMResult<()> -} - -impl Layer { - pub fn append_genome_column( - layer_dir: &Path, - value_of: impl Fn(usize) -> u32, - ) -> OLMResult<()> -} -``` - -These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction. - -### Why the MPHF is never rebuilt - -The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required. +| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation | diff --git a/docmd/implementation/storage.md b/docmd/implementation/storage.md index defabf4..baf3e9e 100644 --- a/docmd/implementation/storage.md +++ b/docmd/implementation/storage.md @@ -1,5 +1,136 @@ -# On-disk collection structure +# On-disk index layout -See [obilayeredmap crate](obilayeredmap.md) for the current on-disk layout. +## Directory tree -The index root contains one `part_XXXXX/` directory per partition, each holding one or more `layer_N/` directories. Each layer directory contains `mphf.bin`, `unitigs.bin`, `unitigs.bin.idx`, `evidence.bin`, and optionally a `counts/` or `presence/` payload directory. +``` +/ + index.meta ← JSON: IndexMeta + scatter.done ← sentinel: scatter phase complete + count.done ← sentinel: dereplicate + count complete + index.done ← sentinel: MPHF index fully built + spectrums/ +