45d49ed501
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
347 lines
13 KiB
Markdown
347 lines
13 KiB
Markdown
# Kmer index architecture
|
||
|
||
## Fundamental invariant
|
||
|
||
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
|
||
|
||
---
|
||
|
||
## Three-level hierarchy
|
||
|
||
```
|
||
PartitionedIndex
|
||
├── LayeredPartition (one per minimiser bucket)
|
||
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
|
||
│ │ ├── DataStore A slot → T (e.g. counts)
|
||
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
|
||
│ ├── MphfLayer 1
|
||
│ │ └── DataStore A
|
||
│ └── ...
|
||
├── LayeredPartition
|
||
│ └── ...
|
||
```
|
||
|
||
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
|
||
|
||
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
|
||
|
||
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
|
||
|
||
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
|
||
|
||
---
|
||
|
||
## MphfLayer — autonomous mapping layer
|
||
|
||
```rust
|
||
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
|
||
MphfLayer::n() -> usize // number of slots
|
||
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
|
||
MphfLayer::open(dir: &Path) -> OLMResult<Self>
|
||
```
|
||
|
||
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
|
||
|
||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
|
||
|
||
---
|
||
|
||
## DataStore — slot-indexed data
|
||
|
||
```rust
|
||
trait DataStore {
|
||
type Item;
|
||
fn get(&self, slot: usize) -> Self::Item;
|
||
fn n(&self) -> usize;
|
||
}
|
||
```
|
||
|
||
Concrete types from `obicompactvec`:
|
||
|
||
| Type | `Item` | Column stats | Use |
|
||
|---|---|---|---|
|
||
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
|
||
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
|
||
|
||
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
|
||
|
||
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
|
||
|
||
---
|
||
|
||
## Distance matrix API on DataStore types
|
||
|
||
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
|
||
|
||
### Full distance matrices
|
||
|
||
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix
|
||
fn bray_dist_matrix(&self) -> Array2<f64>
|
||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
|
||
fn euclidean_dist_matrix(&self) -> Array2<f64>
|
||
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
|
||
fn hellinger_dist_matrix(&self) -> Array2<f64>
|
||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
|
||
|
||
// PersistentBitMatrix
|
||
fn jaccard_dist_matrix(&self) -> Array2<f64>
|
||
fn hamming_dist_matrix(&self) -> Array2<u64>
|
||
```
|
||
|
||
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
|
||
|
||
### Partial distance matrices
|
||
|
||
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
|
||
|
||
**Category 1 — self-contained partials**: additive without any external parameter.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix
|
||
fn partial_bray_dist_matrix(&self)
|
||
-> (Array2<u64>, // sum_min[i,j]
|
||
Array1<u64>) // col_sums[k]
|
||
|
||
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
|
||
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
|
||
-> (Array2<u64>, // inter[i,j]
|
||
Array2<u64>) // union[i,j]
|
||
|
||
// PersistentBitMatrix
|
||
fn partial_jaccard_dist_matrix(&self)
|
||
-> (Array2<u64>, // inter[i,j]
|
||
Array2<u64>) // union[i,j]
|
||
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
|
||
```
|
||
|
||
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
|
||
|
||
```rust
|
||
// PersistentCompactIntMatrix only
|
||
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
|
||
|
||
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
|
||
|
||
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
|
||
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
|
||
```
|
||
|
||
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
|
||
|
||
---
|
||
|
||
## Progressive aggregation principle
|
||
|
||
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
|
||
|
||
```
|
||
PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
|
||
↓ Σ across layers
|
||
LayeredCompactIntMatrix::sum() — column sums for one partition
|
||
↓ Σ across partitions
|
||
PartitionedCompactIntMatrix::sum() — global column sums
|
||
```
|
||
|
||
The same cascade applies to every partial computation:
|
||
|
||
```
|
||
PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
|
||
↓ element-wise Σ across layers
|
||
LayeredCompactIntMatrix::partial_bray() — one partition
|
||
↓ element-wise Σ across partitions
|
||
PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
|
||
```
|
||
|
||
This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
|
||
|
||
---
|
||
|
||
## LayeredDataStore — aggregation within one partition
|
||
|
||
A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
|
||
|
||
```rust
|
||
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
|
||
struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
|
||
```
|
||
|
||
### Column statistics
|
||
|
||
```rust
|
||
// LayeredCompactIntMatrix
|
||
fn sum(&self) -> Array1<u64>
|
||
// = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
|
||
|
||
// LayeredBitMatrix
|
||
fn count_ones(&self) -> Array1<u64>
|
||
// = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
|
||
```
|
||
|
||
### Self-contained partials
|
||
|
||
Each method reduces across layers by element-wise addition of per-layer matrices:
|
||
|
||
```rust
|
||
fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
|
||
// Σ_l layer_l.partial_bray_dist_matrix()
|
||
|
||
fn partial_euclidean(&self) -> Array2<f64>
|
||
// Σ_l layer_l.partial_euclidean_dist_matrix()
|
||
|
||
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
|
||
// Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
|
||
// Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
|
||
|
||
fn partial_hamming(&self) -> Array2<u64>
|
||
// Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
|
||
```
|
||
|
||
### Normalised partials (require global sums from above)
|
||
|
||
```rust
|
||
fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||
// Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
|
||
|
||
fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||
// Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
|
||
|
||
fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
|
||
// Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
|
||
```
|
||
|
||
`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
|
||
|
||
---
|
||
|
||
## PartitionedDataStore — aggregation across all partitions
|
||
|
||
A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
|
||
|
||
```rust
|
||
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
|
||
struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
|
||
```
|
||
|
||
### Column statistics
|
||
|
||
```rust
|
||
fn sum(&self) -> Array1<u64>
|
||
// = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
|
||
```
|
||
|
||
`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
|
||
|
||
### Self-contained metrics — single pass
|
||
|
||
```rust
|
||
fn bray_dist_matrix(&self) -> Array2<f64> {
|
||
let (sum_min, col_sums) = partitions
|
||
.par_iter()
|
||
.map(|p| p.partial_bray())
|
||
.reduce(element-wise +);
|
||
// finalise
|
||
for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
|
||
}
|
||
```
|
||
|
||
### Normalised metrics — two passes
|
||
|
||
```rust
|
||
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
|
||
// pass 1 — progressive: PartitionedDataStore::sum()
|
||
// calls LayeredDataStore::sum() per partition (parallel)
|
||
// calls PersistentCompactIntMatrix::sum() per layer (parallel)
|
||
let global_sums = self.sum();
|
||
|
||
// pass 2 — per-partition partial using global_sums (parallel)
|
||
let matrix = partitions
|
||
.par_iter()
|
||
.map(|p| p.partial_relfreq_bray(&global_sums))
|
||
.reduce(element-wise +);
|
||
// finalise
|
||
for (i,j): dist[i,j] = 1 - matrix[i,j]
|
||
}
|
||
```
|
||
|
||
`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
|
||
|
||
---
|
||
|
||
## Parallelism model
|
||
|
||
| Level | Unit | Coordination |
|
||
|---|---|---|
|
||
| Across partitions | `LayeredDataStore` | none — fully independent |
|
||
| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
|
||
| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
|
||
| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
|
||
| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |
|
||
|
||
---
|
||
|
||
## Query model
|
||
|
||
### Point query — `kmer → Option<Item>`
|
||
|
||
```
|
||
minimiser(kmer) → partition p
|
||
for each layer l in p:
|
||
slot = MphfLayer_l.find(kmer)
|
||
if slot is Some:
|
||
return DataStore_l.get(slot)
|
||
return None
|
||
```
|
||
|
||
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
|
||
|
||
### Aggregation — `→ Result`
|
||
|
||
```
|
||
result = reduce(
|
||
for p in partitions: // parallel
|
||
for l in layers(p): // parallel
|
||
partial(DataStore_p_l)
|
||
)
|
||
```
|
||
|
||
For normalised metrics replace with the two-pass scheme above.
|
||
|
||
---
|
||
|
||
## DataStore derivation
|
||
|
||
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
|
||
|
||
```
|
||
// count → presence/absence, parallel across (partition, layer)
|
||
for (p, l) in all_partition_layer_pairs().par_iter():
|
||
count_store = open PersistentCompactIntMatrix at (p, l)
|
||
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
||
```
|
||
|
||
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
|
||
|
||
---
|
||
|
||
## Relationship to current implementation
|
||
|
||
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
|
||
|
||
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
|
||
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
|
||
- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
|
||
- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.
|
||
|
||
Planned refactoring:
|
||
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
||
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
|
||
3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
|
||
4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
|
||
5. Implement `PartitionedIndex` for point queries with parallel dispatch.
|