feat: introduce trait-based distance aggregation and layered store
Introduces ColumnWeights, CountPartials, and BitPartials traits to compute and finalize partial distance matrices. Implements these traits for PersistentBitMatrix, PersistentCompactIntMatrix, and a new LayeredStore<S> wrapper that aggregates metrics across layers via parallel reduction. Adds ndarray for numerical aggregation and updates architecture documentation to reflect the trait-driven design and pending refactoring roadmap.
This commit is contained in:
@@ -141,135 +141,112 @@ The `col_sums` parameter must reflect the GLOBAL count across all layers and all
|
||||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
|
||||
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
|
||||
↓ Σ across layers
|
||||
LayeredCompactIntMatrix::sum() — column sums for one partition
|
||||
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
|
||||
↓ Σ across partitions
|
||||
PartitionedCompactIntMatrix::sum() — global column sums
|
||||
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
|
||||
```
|
||||
|
||||
The same cascade applies to every partial computation:
|
||||
The same cascade applies to every partial:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
|
||||
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
|
||||
↓ element-wise Σ across layers
|
||||
LayeredCompactIntMatrix::partial_bray() — one partition
|
||||
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
|
||||
↓ element-wise Σ across partitions
|
||||
PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
|
||||
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
|
||||
```
|
||||
|
||||
This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
|
||||
Each level presents a stable trait surface to the level above; no level reaches two levels down.
|
||||
|
||||
---
|
||||
|
||||
## LayeredDataStore — aggregation within one partition
|
||||
## Traits — `obicompactvec::traits`
|
||||
|
||||
A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
|
||||
Three traits unify the aggregation API across all levels of the hierarchy.
|
||||
|
||||
```rust
|
||||
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
|
||||
struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
|
||||
```
|
||||
trait ColumnWeights: Send + Sync {
|
||||
fn col_weights(&self) -> Array1<u64>;
|
||||
}
|
||||
|
||||
### Column statistics
|
||||
trait CountPartials: ColumnWeights {
|
||||
// self-contained partials (additive, no parameter)
|
||||
fn partial_bray(&self) -> Array2<u64>;
|
||||
fn partial_euclidean(&self) -> Array2<f64>;
|
||||
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
|
||||
// normalised partials (global col_weights passed in cascade)
|
||||
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
|
||||
// provided finalisation methods (default implementations)
|
||||
fn bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { … }
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn hellinger_dist_matrix(&self) -> Array2<f64> { … }
|
||||
}
|
||||
|
||||
```rust
|
||||
// LayeredCompactIntMatrix
|
||||
fn sum(&self) -> Array1<u64>
|
||||
// = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
|
||||
|
||||
// LayeredBitMatrix
|
||||
fn count_ones(&self) -> Array1<u64>
|
||||
// = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
|
||||
```
|
||||
|
||||
### Self-contained partials
|
||||
|
||||
Each method reduces across layers by element-wise addition of per-layer matrices:
|
||||
|
||||
```rust
|
||||
fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
|
||||
// Σ_l layer_l.partial_bray_dist_matrix()
|
||||
|
||||
fn partial_euclidean(&self) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_euclidean_dist_matrix()
|
||||
|
||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
|
||||
// Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
|
||||
// Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
|
||||
|
||||
fn partial_hamming(&self) -> Array2<u64>
|
||||
// Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
|
||||
```
|
||||
|
||||
### Normalised partials (require global sums from above)
|
||||
|
||||
```rust
|
||||
fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
|
||||
|
||||
fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
|
||||
|
||||
fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
|
||||
```
|
||||
|
||||
`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
|
||||
|
||||
---
|
||||
|
||||
## PartitionedDataStore — aggregation across all partitions
|
||||
|
||||
A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
|
||||
|
||||
```rust
|
||||
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
|
||||
struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
|
||||
```
|
||||
|
||||
### Column statistics
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> Array1<u64>
|
||||
// = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
|
||||
```
|
||||
|
||||
`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
|
||||
|
||||
### Self-contained metrics — single pass
|
||||
|
||||
```rust
|
||||
fn bray_dist_matrix(&self) -> Array2<f64> {
|
||||
let (sum_min, col_sums) = partitions
|
||||
.par_iter()
|
||||
.map(|p| p.partial_bray())
|
||||
.reduce(element-wise +);
|
||||
// finalise
|
||||
for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
|
||||
trait BitPartials: ColumnWeights {
|
||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
|
||||
fn partial_hamming(&self) -> Array2<u64>;
|
||||
// provided
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
|
||||
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
|
||||
}
|
||||
```
|
||||
|
||||
### Normalised metrics — two passes
|
||||
**Leaf implementors** (in `obicompactvec`):
|
||||
|
||||
| Type | Traits |
|
||||
|---|---|
|
||||
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
|
||||
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |
|
||||
|
||||
`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.
|
||||
|
||||
---
|
||||
|
||||
## `LayeredStore<S>` — `obilayeredmap`
|
||||
|
||||
A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:
|
||||
|
||||
```rust
|
||||
pub struct LayeredStore<S>(Vec<S>);
|
||||
```
|
||||
|
||||
Three blanket impls propagate the traits up the hierarchy:
|
||||
|
||||
```rust
|
||||
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ across inner stores
|
||||
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // same pattern
|
||||
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // same pattern
|
||||
```
|
||||
|
||||
Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
|
||||
```
|
||||
|
||||
### Normalised metrics — two-pass cascade
|
||||
|
||||
The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
|
||||
|
||||
```rust
|
||||
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||
// pass 1 — progressive: PartitionedDataStore::sum()
|
||||
// calls LayeredDataStore::sum() per partition (parallel)
|
||||
// calls PersistentCompactIntMatrix::sum() per layer (parallel)
|
||||
let global_sums = self.sum();
|
||||
|
||||
// pass 2 — per-partition partial using global_sums (parallel)
|
||||
let matrix = partitions
|
||||
.par_iter()
|
||||
.map(|p| p.partial_relfreq_bray(&global_sums))
|
||||
.reduce(element-wise +);
|
||||
// finalise
|
||||
for (i,j): dist[i,j] = 1 - matrix[i,j]
|
||||
let global = self.col_weights(); // pass 1 — progressive sum at every level
|
||||
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
|
||||
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
|
||||
}
|
||||
```
|
||||
|
||||
`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
|
||||
`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.
|
||||
|
||||
---
|
||||
|
||||
@@ -277,11 +254,13 @@ fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||
|
||||
| Level | Unit | Coordination |
|
||||
|---|---|---|
|
||||
| Across partitions | `LayeredDataStore` | none — fully independent |
|
||||
| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
|
||||
| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
|
||||
| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
|
||||
| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |
|
||||
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
|
||||
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
|
||||
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
|
||||
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
|
||||
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |
|
||||
|
||||
All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.
|
||||
|
||||
---
|
||||
|
||||
@@ -331,16 +310,20 @@ Other derivations: threshold a count matrix → binary presence matrix; union tw
|
||||
|
||||
## Relationship to current implementation
|
||||
|
||||
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
|
||||
### What is implemented
|
||||
|
||||
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
|
||||
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
|
||||
- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
|
||||
- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.
|
||||
- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
|
||||
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
|
||||
|
||||
### What is not yet implemented
|
||||
|
||||
- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
|
||||
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
|
||||
- No `PartitionedIndex` type for point queries with parallel partition dispatch.
|
||||
|
||||
### Planned refactoring
|
||||
|
||||
Planned refactoring:
|
||||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
||||
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
|
||||
3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
|
||||
4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
|
||||
5. Implement `PartitionedIndex` for point queries with parallel dispatch.
|
||||
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
||||
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
||||
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|
||||
|
||||
Reference in New Issue
Block a user