13e69e23c9
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
330 lines
14 KiB
Markdown
330 lines
14 KiB
Markdown
# Kmer index architecture
|
||
|
||
## Fundamental invariant
|
||
|
||
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
|
||
|
||
---
|
||
|
||
## Three-level hierarchy
|
||
|
||
```
|
||
PartitionedIndex
|
||
├── LayeredPartition (one per minimiser bucket)
|
||
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
|
||
│ │ ├── DataStore A slot → T (e.g. counts)
|
||
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
|
||
│ ├── MphfLayer 1
|
||
│ │ └── DataStore A
|
||
│ └── ...
|
||
├── LayeredPartition
|
||
│ └── ...
|
||
```
|
||
|
||
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
|
||
|
||
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
|
||
|
||
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
|
||
|
||
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
|
||
|
||
---
|
||
|
||
## MphfLayer — autonomous mapping layer
|
||
|
||
```rust
|
||
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
|
||
MphfLayer::n() -> usize // number of slots
|
||
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
|
||
MphfLayer::open(dir: &Path) -> OLMResult<Self>
|
||
```
|
||
|
||
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
|
||
|
||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
|
||
|
||
---
|
||
|
||
## DataStore — slot-indexed data
|
||
|
||
```rust
|
||
trait DataStore {
|
||
type Item;
|
||
fn get(&self, slot: usize) -> Self::Item;
|
||
fn n(&self) -> usize;
|
||
}
|
||
```
|
||
|
||
Concrete types from `obicompactvec`:
|
||
|
||
| Type | `Item` | Column stats | Use |
|
||
|---|---|---|---|
|
||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
|
||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
|
||
|
||
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
|
||
|
||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
|
||
|
||
---
|
||
|
||
## Distance matrix API on DataStore types
|
||
|
||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
|
||
|
||
### Full distance matrices
|
||
|
||
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix
|
||
fn bray_dist_matrix(&self) -> Array2<f64>
|
||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
|
||
fn euclidean_dist_matrix(&self) -> Array2<f64>
|
||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
|
||
fn hellinger_dist_matrix(&self) -> Array2<f64>
|
||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
|
||
|
||
// PersistentBitMatrix
|
||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||
fn hamming_dist_matrix(&self) -> Array2<u64>
|
||
```
|
||
|
||
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
|
||
|
||
### Partial distance matrices
|
||
|
||
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
|
||
|
||
**Category 1 — self-contained partials**: additive without any external parameter.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix
|
||
fn partial_bray_dist_matrix(&self)
|
||
-> (Array2<u64>, // sum_min[i,j]
|
||
Array1<u64>) // col_sums[k]
|
||
|
||
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
|
||
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
|
||
-> (Array2<u64>, // inter[i,j]
|
||
Array2<u64>) // union[i,j]
|
||
|
||
// PersistentBitMatrix
|
||
fn partial_jaccard_dist_matrix(&self)
|
||
-> (Array2<u64>, // inter[i,j]
|
||
Array2<u64>) // union[i,j]
|
||
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
|
||
```
|
||
|
||
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix only
|
||
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
|
||
|
||
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
|
||
|
||
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
|
||
```
|
||
|
||
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
|
||
|
||
---
|
||
|
||
## Progressive aggregation principle
|
||
|
||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
||
|
||
```
|
||
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
|
||
↓ Σ across layers
|
||
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
|
||
↓ Σ across partitions
|
||
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
|
||
```
|
||
|
||
The same cascade applies to every partial:
|
||
|
||
```
|
||
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
|
||
↓ element-wise Σ across layers
|
||
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
|
||
↓ element-wise Σ across partitions
|
||
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
|
||
```
|
||
|
||
Each level presents a stable trait surface to the level above; no level reaches two levels down.
|
||
|
||
---
|
||
|
||
## Traits — `obicompactvec::traits`
|
||
|
||
Three traits unify the aggregation API across all levels of the hierarchy.
|
||
|
||
```rust
|
||
trait ColumnWeights: Send + Sync {
|
||
fn col_weights(&self) -> Array1<u64>;
|
||
}
|
||
|
||
trait CountPartials: ColumnWeights {
|
||
// self-contained partials (additive, no parameter)
|
||
fn partial_bray(&self) -> Array2<u64>;
|
||
fn partial_euclidean(&self) -> Array2<f64>;
|
||
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
||
// normalised partials (global col_weights passed in cascade)
|
||
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
||
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
||
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
||
// provided finalisation methods (default implementations)
|
||
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
||
fn euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { … }
|
||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||
fn hellinger_dist_matrix(&self) -> Array2<f64> { … }
|
||
}
|
||
|
||
trait BitPartials: ColumnWeights {
|
||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
||
fn partial_hamming(&self) -> Array2<u64>;
|
||
// provided
|
||
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
|
||
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
|
||
}
|
||
```
|
||
|
||
**Leaf implementors** (in `obicompactvec`):
|
||
|
||
| Type | Traits |
|
||
|---|---|
|
||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
|
||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |
|
||
|
||
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||
|
||
---
|
||
|
||
## `LayeredStore<S>` — `obilayeredmap`
|
||
|
||
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
|
||
|
||
```rust
|
||
pub struct LayeredStore<S>(Vec<S>);
|
||
```
|
||
|
||
Three blanket impls propagate the traits up the hierarchy:
|
||
|
||
```rust
|
||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ across inner stores
|
||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // same pattern
|
||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // same pattern
|
||
```
|
||
|
||
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:
|
||
|
||
```
|
||
PersistentCompactIntMatrix implements CountPartials
|
||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
|
||
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
|
||
```
|
||
|
||
### Normalised metrics — two-pass cascade
|
||
|
||
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
|
||
|
||
```rust
|
||
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||
let global = self.col_weights(); // pass 1 — progressive sum at every level
|
||
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
|
||
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
|
||
}
|
||
```
|
||
|
||
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.
|
||
|
||
---
|
||
|
||
## Parallelism model
|
||
|
||
| Level | Unit | Coordination |
|
||
|---|---|---|
|
||
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
|
||
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
|
||
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
|
||
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
|
||
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
|
||
|
||
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.
|
||
|
||
---
|
||
|
||
## Query model
|
||
|
||
### Point query — `kmer → Option<Item>`
|
||
|
||
```
|
||
minimiser(kmer) → partition p
|
||
for each layer l in p:
|
||
slot = MphfLayer_l.find(kmer)
|
||
if slot is Some:
|
||
return DataStore_l.get(slot)
|
||
return None
|
||
```
|
||
|
||
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
|
||
|
||
### Aggregation — `→ Result`
|
||
|
||
```
|
||
result = reduce(
|
||
for p in partitions: // parallel
|
||
for l in layers(p): // parallel
|
||
partial(DataStore_p_l)
|
||
)
|
||
```
|
||
|
||
For normalised metrics replace with the two-pass scheme above.
|
||
|
||
---
|
||
|
||
## DataStore derivation
|
||
|
||
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
|
||
|
||
```
|
||
// count → presence/absence, parallel across (partition, layer)
|
||
for (p, l) in all_partition_layer_pairs().par_iter():
|
||
count_store = open PersistentCompactIntMatrix at (p, l)
|
||
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
||
```
|
||
|
||
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
|
||
|
||
---
|
||
|
||
## Relationship to current implementation
|
||
|
||
### What is implemented
|
||
|
||
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
||
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
|
||
|
||
### What is not yet implemented
|
||
|
||
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
|
||
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
|
||
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
|
||
|
||
### Planned refactoring
|
||
|
||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
||
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
||
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
||
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|