Files
obikmer/docmd/architecture/index_architecture.md
T

347 lines
13 KiB
Markdown
Raw Normal View History

# Kmer index architecture
## Fundamental invariant
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
---
## Three-level hierarchy
```
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
```
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
---
## MphfLayer — autonomous mapping layer
```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
```
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
---
## DataStore — slot-indexed data
```rust
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
```
Concrete types from `obicompactvec`:
| Type | `Item` | Column stats | Use |
|---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
---
## Distance matrix API on DataStore types
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
### Full distance matrices
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
```rust
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self) -> Array2<f64>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
fn euclidean_dist_matrix(&self) -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self) -> Array2<f64>
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
// PersistentBitMatrix
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
```
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
### Partial distance matrices
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
**Category 1 — self-contained partials**: additive without any external parameter.
```rust
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
-> (Array2<u64>, // sum_min[i,j]
Array1<u64>) // col_sums[k]
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
```
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
```rust
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
```
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
---
## Progressive aggregation principle
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
```
PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
↓ Σ across layers
LayeredCompactIntMatrix::sum() — column sums for one partition
↓ Σ across partitions
PartitionedCompactIntMatrix::sum() — global column sums
```
The same cascade applies to every partial computation:
```
PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
↓ element-wise Σ across layers
LayeredCompactIntMatrix::partial_bray() — one partition
↓ element-wise Σ across partitions
PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
```
This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
---
## LayeredDataStore — aggregation within one partition
A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
```rust
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
```
### Column statistics
```rust
// LayeredCompactIntMatrix
fn sum(&self) -> Array1<u64>
// = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
// LayeredBitMatrix
fn count_ones(&self) -> Array1<u64>
// = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
```
### Self-contained partials
Each method reduces across layers by element-wise addition of per-layer matrices:
```rust
fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
// Σ_l layer_l.partial_bray_dist_matrix()
fn partial_euclidean(&self) -> Array2<f64>
// Σ_l layer_l.partial_euclidean_dist_matrix()
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
// Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
// Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
fn partial_hamming(&self) -> Array2<u64>
// Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
```
### Normalised partials (require global sums from above)
```rust
fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
```
`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
---
## PartitionedDataStore — aggregation across all partitions
A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
```rust
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
```
### Column statistics
```rust
fn sum(&self) -> Array1<u64>
// = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
```
`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
### Self-contained metrics — single pass
```rust
fn bray_dist_matrix(&self) -> Array2<f64> {
let (sum_min, col_sums) = partitions
.par_iter()
.map(|p| p.partial_bray())
.reduce(element-wise +);
// finalise
for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
}
```
### Normalised metrics — two passes
```rust
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
// pass 1 — progressive: PartitionedDataStore::sum()
// calls LayeredDataStore::sum() per partition (parallel)
// calls PersistentCompactIntMatrix::sum() per layer (parallel)
let global_sums = self.sum();
// pass 2 — per-partition partial using global_sums (parallel)
let matrix = partitions
.par_iter()
.map(|p| p.partial_relfreq_bray(&global_sums))
.reduce(element-wise +);
// finalise
for (i,j): dist[i,j] = 1 - matrix[i,j]
}
```
`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
---
## Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | `LayeredDataStore` | none — fully independent |
| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |
---
## Query model
### Point query — `kmer → Option<Item>`
```
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
```
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
### Aggregation — `→ Result`
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
```
For normalised metrics replace with the two-pass scheme above.
---
## DataStore derivation
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
```
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
---
## Relationship to current implementation
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.
Planned refactoring:
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
5. Implement `PartitionedIndex` for point queries with parallel dispatch.