Files
obikmer/docmd/architecture/index_architecture.md
T
Eric Coissac 45d49ed501 docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:24:30 +08:00

347 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kmer index architecture
## Fundamental invariant
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
---
## Three-level hierarchy
```
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
```
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
---
## MphfLayer — autonomous mapping layer
```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
```
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
---
## DataStore — slot-indexed data
```rust
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
```
Concrete types from `obicompactvec`:
| Type | `Item` | Column stats | Use |
|---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |
`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
---
## Distance matrix API on DataStore types
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
### Full distance matrices
Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
```rust
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self) -> Array2<f64>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
fn euclidean_dist_matrix(&self) -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self) -> Array2<f64>
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
// PersistentBitMatrix
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
```
These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
### Partial distance matrices
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
**Category 1 — self-contained partials**: additive without any external parameter.
```rust
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
-> (Array2<u64>, // sum_min[i,j]
Array1<u64>) // col_sums[k]
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
```
**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
```rust
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
```
The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
---
## Progressive aggregation principle
Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
```
PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
↓ Σ across layers
LayeredCompactIntMatrix::sum() — column sums for one partition
↓ Σ across partitions
PartitionedCompactIntMatrix::sum() — global column sums
```
The same cascade applies to every partial computation:
```
PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
↓ element-wise Σ across layers
LayeredCompactIntMatrix::partial_bray() — one partition
↓ element-wise Σ across partitions
PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
```
This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
---
## LayeredDataStore — aggregation within one partition
A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
```rust
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
```
### Column statistics
```rust
// LayeredCompactIntMatrix
fn sum(&self) -> Array1<u64>
// = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
// LayeredBitMatrix
fn count_ones(&self) -> Array1<u64>
// = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
```
### Self-contained partials
Each method reduces across layers by element-wise addition of per-layer matrices:
```rust
fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
// Σ_l layer_l.partial_bray_dist_matrix()
fn partial_euclidean(&self) -> Array2<f64>
// Σ_l layer_l.partial_euclidean_dist_matrix()
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
// Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
// Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
fn partial_hamming(&self) -> Array2<u64>
// Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
```
### Normalised partials (require global sums from above)
```rust
fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
// Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
```
`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
---
## PartitionedDataStore — aggregation across all partitions
A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
```rust
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
```
### Column statistics
```rust
fn sum(&self) -> Array1<u64>
// = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
```
`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
### Self-contained metrics — single pass
```rust
fn bray_dist_matrix(&self) -> Array2<f64> {
let (sum_min, col_sums) = partitions
.par_iter()
.map(|p| p.partial_bray())
.reduce(element-wise +);
// finalise
for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
}
```
### Normalised metrics — two passes
```rust
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
// pass 1 — progressive: PartitionedDataStore::sum()
// calls LayeredDataStore::sum() per partition (parallel)
// calls PersistentCompactIntMatrix::sum() per layer (parallel)
let global_sums = self.sum();
// pass 2 — per-partition partial using global_sums (parallel)
let matrix = partitions
.par_iter()
.map(|p| p.partial_relfreq_bray(&global_sums))
.reduce(element-wise +);
// finalise
for (i,j): dist[i,j] = 1 - matrix[i,j]
}
```
`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
---
## Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | `LayeredDataStore` | none — fully independent |
| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |
---
## Query model
### Point query — `kmer → Option<Item>`
```
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
```
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
### Aggregation — `→ Result`
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
```
For normalised metrics replace with the two-pass scheme above.
---
## DataStore derivation
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
```
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.
---
## Relationship to current implementation
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.
Planned refactoring:
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
5. Implement `PartitionedIndex` for point queries with parallel dispatch.