docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
This commit is contained in:
@@ -58,12 +58,230 @@ trait DataStore {
|
||||
|
||||
Concrete types from `obicompactvec`:
|
||||
|
||||
| Type | `Item` | Use |
|
||||
|---|---|---|
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | count per sample per slot |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | presence per sample per slot |
|
||||
| Type | `Item` | Column stats | Use |
|
||||
|---|---|---|---|
|
||||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
|
||||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
|
||||
|
||||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. The path to its on-disk files is managed by the `LayeredPartition`, not embedded in the store type.
|
||||
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
|
||||
|
||||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
|
||||
|
||||
---
|
||||
|
||||
## Distance matrix API on DataStore types
|
||||
|
||||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
|
||||
|
||||
### Full distance matrices
|
||||
|
||||
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix
|
||||
fn bray_dist_matrix(&self) -> Array2<f64>
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
|
||||
fn euclidean_dist_matrix(&self) -> Array2<f64>
|
||||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
|
||||
fn hellinger_dist_matrix(&self) -> Array2<f64>
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
|
||||
|
||||
// PersistentBitMatrix
|
||||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||||
fn hamming_dist_matrix(&self) -> Array2<u64>
|
||||
```
|
||||
|
||||
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
|
||||
|
||||
### Partial distance matrices
|
||||
|
||||
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
|
||||
|
||||
**Category 1 — self-contained partials**: additive without any external parameter.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix
|
||||
fn partial_bray_dist_matrix(&self)
|
||||
-> (Array2<u64>, // sum_min[i,j]
|
||||
Array1<u64>) // col_sums[k]
|
||||
|
||||
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
|
||||
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
|
||||
-> (Array2<u64>, // inter[i,j]
|
||||
Array2<u64>) // union[i,j]
|
||||
|
||||
// PersistentBitMatrix
|
||||
fn partial_jaccard_dist_matrix(&self)
|
||||
-> (Array2<u64>, // inter[i,j]
|
||||
Array2<u64>) // union[i,j]
|
||||
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
|
||||
```
|
||||
|
||||
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
|
||||
|
||||
```rust
|
||||
// PersistentCompactIntMatrix only
|
||||
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
|
||||
|
||||
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
|
||||
|
||||
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||||
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
|
||||
```
|
||||
|
||||
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
|
||||
|
||||
---
|
||||
|
||||
## Progressive aggregation principle
|
||||
|
||||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
|
||||
↓ Σ across layers
|
||||
LayeredCompactIntMatrix::sum() — column sums for one partition
|
||||
↓ Σ across partitions
|
||||
PartitionedCompactIntMatrix::sum() — global column sums
|
||||
```
|
||||
|
||||
The same cascade applies to every partial computation:
|
||||
|
||||
```
|
||||
PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
|
||||
↓ element-wise Σ across layers
|
||||
LayeredCompactIntMatrix::partial_bray() — one partition
|
||||
↓ element-wise Σ across partitions
|
||||
PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
|
||||
```
|
||||
|
||||
This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
|
||||
|
||||
---
|
||||
|
||||
## LayeredDataStore — aggregation within one partition
|
||||
|
||||
A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
|
||||
|
||||
```rust
|
||||
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
|
||||
struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
|
||||
```
|
||||
|
||||
### Column statistics
|
||||
|
||||
```rust
|
||||
// LayeredCompactIntMatrix
|
||||
fn sum(&self) -> Array1<u64>
|
||||
// = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
|
||||
|
||||
// LayeredBitMatrix
|
||||
fn count_ones(&self) -> Array1<u64>
|
||||
// = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
|
||||
```
|
||||
|
||||
### Self-contained partials
|
||||
|
||||
Each method reduces across layers by element-wise addition of per-layer matrices:
|
||||
|
||||
```rust
|
||||
fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
|
||||
// Σ_l layer_l.partial_bray_dist_matrix()
|
||||
|
||||
fn partial_euclidean(&self) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_euclidean_dist_matrix()
|
||||
|
||||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
|
||||
// Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
|
||||
// Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
|
||||
|
||||
fn partial_hamming(&self) -> Array2<u64>
|
||||
// Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
|
||||
```
|
||||
|
||||
### Normalised partials (require global sums from above)
|
||||
|
||||
```rust
|
||||
fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
|
||||
|
||||
fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
|
||||
|
||||
fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||||
// Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
|
||||
```
|
||||
|
||||
`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
|
||||
|
||||
---
|
||||
|
||||
## PartitionedDataStore — aggregation across all partitions
|
||||
|
||||
A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
|
||||
|
||||
```rust
|
||||
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
|
||||
struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
|
||||
```
|
||||
|
||||
### Column statistics
|
||||
|
||||
```rust
|
||||
fn sum(&self) -> Array1<u64>
|
||||
// = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
|
||||
```
|
||||
|
||||
`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
|
||||
|
||||
### Self-contained metrics — single pass
|
||||
|
||||
```rust
|
||||
fn bray_dist_matrix(&self) -> Array2<f64> {
|
||||
let (sum_min, col_sums) = partitions
|
||||
.par_iter()
|
||||
.map(|p| p.partial_bray())
|
||||
.reduce(element-wise +);
|
||||
// finalise
|
||||
for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
|
||||
}
|
||||
```
|
||||
|
||||
### Normalised metrics — two passes
|
||||
|
||||
```rust
|
||||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||||
// pass 1 — progressive: PartitionedDataStore::sum()
|
||||
// calls LayeredDataStore::sum() per partition (parallel)
|
||||
// calls PersistentCompactIntMatrix::sum() per layer (parallel)
|
||||
let global_sums = self.sum();
|
||||
|
||||
// pass 2 — per-partition partial using global_sums (parallel)
|
||||
let matrix = partitions
|
||||
.par_iter()
|
||||
.map(|p| p.partial_relfreq_bray(&global_sums))
|
||||
.reduce(element-wise +);
|
||||
// finalise
|
||||
for (i,j): dist[i,j] = 1 - matrix[i,j]
|
||||
}
|
||||
```
|
||||
|
||||
`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
|
||||
|
||||
---
|
||||
|
||||
## Parallelism model
|
||||
|
||||
| Level | Unit | Coordination |
|
||||
|---|---|---|
|
||||
| Across partitions | `LayeredDataStore` | none — fully independent |
|
||||
| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
|
||||
| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
|
||||
| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
|
||||
| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |
|
||||
|
||||
---
|
||||
|
||||
@@ -80,65 +298,19 @@ for each layer l in p:
|
||||
return None
|
||||
```
|
||||
|
||||
O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.
|
||||
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
|
||||
|
||||
### Sequence scan — `sequence → Vec<(kmer, Option<Item>)>`
|
||||
|
||||
Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.
|
||||
|
||||
Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.
|
||||
|
||||
### Aggregation — `→ Accumulator`
|
||||
|
||||
For operations that traverse all kmers (distance, presence matrix, global counts):
|
||||
### Aggregation — `→ Result`
|
||||
|
||||
```
|
||||
result = reduce(
|
||||
for p in partitions: // parallel
|
||||
for l in layers(p): // parallel
|
||||
for p in partitions: // parallel
|
||||
for l in layers(p): // parallel
|
||||
partial(DataStore_p_l)
|
||||
)
|
||||
```
|
||||
|
||||
Each `(partition, layer)` contributes an independent `Partial`. Global result = `reduce(all partials)`.
|
||||
|
||||
---
|
||||
|
||||
## Aggregator pattern
|
||||
|
||||
```rust
|
||||
trait Aggregator<D: DataStore> {
|
||||
type Partial: Send;
|
||||
type Result;
|
||||
fn partial(&self, store: &D) -> Self::Partial;
|
||||
fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
|
||||
}
|
||||
```
|
||||
|
||||
Concrete aggregators:
|
||||
|
||||
| Aggregator | `Partial` | `Result` |
|
||||
|---|---|---|
|
||||
| `BrayCurtis(i, j)` | `(sum_min, sum_a, sum_b): (u64, u64, u64)` | `f64` |
|
||||
| `Jaccard(i, j)` | `(inter, union): (u64, u64)` | `f64` |
|
||||
| `Hellinger(i, j)` | `(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64)` | `f64` |
|
||||
| `DistanceMatrix(metric)` | `n×n partial matrix` | `n×n f64 matrix` |
|
||||
| `PresenceQuery(kmer)` | — | routed to point query |
|
||||
|
||||
The `partial` for `BrayCurtis(i, j)` on a `PersistentCompactIntMatrix` with columns i and j already exists as `PersistentCompactIntVec::partial_bray_dist` — it needs to be lifted to the column-pair level on the matrix.
|
||||
|
||||
---
|
||||
|
||||
## Parallelism model
|
||||
|
||||
| Level | Unit | Coordination |
|
||||
|---|---|---|
|
||||
| Across partitions | `LayeredPartition` | none — fully independent |
|
||||
| Across layers (aggregation) | `(partition, layer)` pair | none — disjoint kmer sets |
|
||||
| Within a layer (point query) | n/a — single layer per kmer | n/a |
|
||||
| DataStore derivation | one `(partition, layer)` per task | none |
|
||||
|
||||
The dispatch model: `PartitionedIndex::aggregate(aggregator)` fans out over partitions (rayon `par_iter`), each partition fans out over its layers, collects partials, then a top-level `reduce` combines.
|
||||
For normalised metrics replace with the two-pass scheme above.
|
||||
|
||||
---
|
||||
|
||||
@@ -149,17 +321,11 @@ Because the `MphfLayer` is independent of its data stores, new stores can be der
|
||||
```
|
||||
// count → presence/absence, parallel across (partition, layer)
|
||||
for (p, l) in all_partition_layer_pairs().par_iter():
|
||||
count_store = open PersistentCompactIntMatrix at (p, l)
|
||||
count_store = open PersistentCompactIntMatrix at (p, l)
|
||||
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
||||
attach presence_store to MphfLayer(p, l)
|
||||
```
|
||||
|
||||
Other derivations:
|
||||
- Threshold a count matrix → binary presence matrix
|
||||
- Union two presence matrices (same MPHF, different samples)
|
||||
- Merge two count matrices (saturating add, column-wise)
|
||||
|
||||
All derivations are local to a `(partition, layer)` pair and fully parallelisable.
|
||||
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
|
||||
|
||||
---
|
||||
|
||||
@@ -169,11 +335,12 @@ The current `obilayeredmap` crate implements a subset of this architecture. Key
|
||||
|
||||
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
|
||||
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
|
||||
- The `Aggregator` pattern is not yet implemented; partial distance methods exist on `PersistentCompactIntVec` but are not composed across layers and partitions.
|
||||
- No `PartitionedIndex` type exists; `LayeredMap` is a single-partition structure.
|
||||
- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
|
||||
- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.
|
||||
|
||||
Planned refactoring:
|
||||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
||||
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
|
||||
3. Implement `LayeredPartition` that holds `Vec<MphfLayer>` and attaches data stores externally.
|
||||
4. Implement `PartitionedIndex` with parallel dispatch and the `Aggregator` pattern.
|
||||
3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
|
||||
4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
|
||||
5. Implement `PartitionedIndex` for point queries with parallel dispatch.
|
||||
|
||||
Reference in New Issue
Block a user