Files
obikmer/docmd/architecture/index_architecture.md
T
Eric Coissac e1d59fde54 feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 13:44:50 +02:00

357 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kmer index architecture
## Fundamental invariant
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
---
## Three-level hierarchy
```
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
```
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
---
## MphfLayer — autonomous mapping layer
```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
```
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
---
## DataStore — slot-indexed data
```rust
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
```
Concrete types from `obicompactvec`:
| Type | `Item` | Column stats | Use |
|---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
---
## Distance matrix API on DataStore types
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
### Full distance matrices
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
```rust
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self) -> Array2<f64>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
fn euclidean_dist_matrix(&self) -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self) -> Array2<f64>
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
// PersistentBitMatrix
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
```
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
### Partial distance matrices
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
**Category 1 — self-contained partials**: additive without any external parameter.
```rust
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
-> (Array2<u64>, // sum_min[i,j]
Array1<u64>) // col_sums[k]
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
```
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
```rust
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
```
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
---
## Progressive aggregation principle
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
```
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
```
The same cascade applies to every partial:
```
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
↓ element-wise Σ across layers
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
↓ element-wise Σ across partitions
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
```
Each level presents a stable trait surface to the level above; no level reaches two levels down.
---
## Traits — `obicompactvec::traits`
Three traits unify the aggregation API across all levels of the hierarchy.
```rust
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>;
}
trait CountPartials: ColumnWeights {
// self-contained partials (additive, no parameter)
fn partial_bray(&self) -> Array2<u64>;
fn partial_euclidean(&self) -> Array2<f64>;
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
// normalised partials (global col_weights passed in cascade)
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
// provided finalisation methods (default implementations)
fn bray_dist_matrix(&self) -> Array2<f64> { }
fn euclidean_dist_matrix(&self) -> Array2<f64> { }
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { }
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { }
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { }
fn hellinger_dist_matrix(&self) -> Array2<f64> { }
}
trait BitPartials: ColumnWeights {
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
fn partial_hamming(&self) -> Array2<u64>;
// provided
fn jaccard_dist_matrix(&self) -> Array2<f64> { }
fn hamming_dist_matrix(&self) -> Array2<u64> { }
}
```
**Leaf implementors** (in `obicompactvec`):
| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
---
## `LayeredStore<S>` — `obilayeredmap`
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
```rust
pub struct LayeredStore<S>(Vec<S>);
```
Three blanket impls propagate the traits up the hierarchy:
```rust
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { } // Σ across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { } // same pattern
impl<S: BitPartials> BitPartials for LayeredStore<S> { } // same pattern
```
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:
```
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
```
### Normalised metrics — two-pass cascade
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
```rust
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
let global = self.col_weights(); // pass 1 — progressive sum at every level
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
}
```
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.
---
## Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.
---
## Query model
### Point query — `kmer → Option<Item>`
```
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
```
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
### Aggregation — `→ Result`
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
```
For normalised metrics replace with the two-pass scheme above.
---
## DataStore derivation
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
```
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
---
## Relationship to current implementation
### What is implemented
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
### What is not yet implemented
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
### Planned refactoring
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
---
## Multi-genome column invariant
### The invariant
After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).
### How it is maintained
The invariant is established and preserved by three coordinated operations:
**1. Existing layers — column append.**
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.
**2. New layers — absent columns prepended.**
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.
**3. First merge, Presence mode — `init_presence_matrix`.**
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.
### Why the invariant is required
The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.
The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.