# Kmer index architecture ## Fundamental invariant A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` PartitionedIndex ├── LayeredPartition (one per minimiser bucket) │ ├── MphfLayer 0 kmer → slot (immutable bijection) │ │ ├── DataStore A slot → T (e.g. counts) │ │ └── DataStore B slot → T (e.g. presence/absence, derived) │ ├── MphfLayer 1 │ │ └── DataStore A │ └── ... ├── LayeredPartition │ └── ... ``` **PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. **LayeredPartition**: one directory per minimiser bucket. Holds a `Vec`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. **MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. **DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`. --- ## MphfLayer — autonomous mapping layer ```rust MphfLayer::find(kmer: CanonicalKmer) -> Option // slot, or None if absent MphfLayer::n() -> usize // number of slots MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin MphfLayer::open(dir: &Path) -> OLMResult ``` `find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. --- ## DataStore — slot-indexed data ```rust trait DataStore { type Item; fn get(&self, slot: usize) -> Self::Item; fn n(&self) -> usize; } ``` Concrete types from `obicompactvec`: | Type | `Item` | Column stats | Use | |---|---|---|---| | `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | | `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | `sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. --- ## Distance matrix API on DataStore types Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. ### Full distance matrices Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. ```rust // PersistentCompactIntMatrix fn bray_dist_matrix(&self) -> Array2 fn relfreq_bray_dist_matrix(&self) -> Array2 fn euclidean_dist_matrix(&self) -> Array2 fn relfreq_euclidean_dist_matrix(&self) -> Array2 fn hellinger_dist_matrix(&self) -> Array2 fn jaccard_dist_matrix(&self) -> Array2 fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 // PersistentBitMatrix fn jaccard_dist_matrix(&self) -> Array2 fn hamming_dist_matrix(&self) -> Array2 ``` These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. ### Partial distance matrices Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. **Category 1 — self-contained partials**: additive without any external parameter. ```rust // PersistentCompactIntMatrix fn partial_bray_dist_matrix(&self) -> (Array2, // sum_min[i,j] Array1) // col_sums[k] fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) -> (Array2, // inter[i,j] Array2) // union[i,j] // PersistentBitMatrix fn partial_jaccard_dist_matrix(&self) -> (Array2, // inter[i,j] Array2) // union[i,j] fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits ``` **Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. ```rust // PersistentCompactIntMatrix only fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² ``` The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. --- ## Progressive aggregation principle Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. ``` PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix ↓ Σ across layers LayeredCompactIntMatrix::sum() — column sums for one partition ↓ Σ across partitions PartitionedCompactIntMatrix::sum() — global column sums ``` The same cascade applies to every partial computation: ``` PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer) ↓ element-wise Σ across layers LayeredCompactIntMatrix::partial_bray() — one partition ↓ element-wise Σ across partitions PartitionedCompactIntMatrix::partial_bray() — global partial → final dist ``` This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above. --- ## LayeredDataStore — aggregation within one partition A `LayeredDataStore` holds one `DataStore` per layer within a single partition: ```rust struct LayeredCompactIntMatrix { layers: Vec } struct LayeredBitMatrix { layers: Vec } ``` ### Column statistics ```rust // LayeredCompactIntMatrix fn sum(&self) -> Array1 // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +) // LayeredBitMatrix fn count_ones(&self) -> Array1 // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +) ``` ### Self-contained partials Each method reduces across layers by element-wise addition of per-layer matrices: ```rust fn partial_bray(&self) -> (Array2, Array1) // Σ_l layer_l.partial_bray_dist_matrix() fn partial_euclidean(&self) -> Array2 // Σ_l layer_l.partial_euclidean_dist_matrix() fn partial_jaccard(&self) -> (Array2, Array2) // Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix] // Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix] fn partial_hamming(&self) -> Array2 // Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix] ``` ### Normalised partials (require global sums from above) ```rust fn partial_relfreq_bray(&self, global_sums: &Array1) -> Array2 // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums) fn partial_relfreq_euclidean(&self, global_sums: &Array1) -> Array2 // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums) fn partial_hellinger(&self, global_sums: &Array1) -> Array2 // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums) ``` `global_sums` is provided by the `PartitionedDataStore`; this level does not compute it. --- ## PartitionedDataStore — aggregation across all partitions A `PartitionedDataStore` holds one `LayeredDataStore` per partition: ```rust struct PartitionedCompactIntMatrix { partitions: Vec } struct PartitionedBitMatrix { partitions: Vec } ``` ### Column statistics ```rust fn sum(&self) -> Array1 // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +) ``` `p.sum()` is itself a reduction across layers (see above) — the cascade is preserved. ### Self-contained metrics — single pass ```rust fn bray_dist_matrix(&self) -> Array2 { let (sum_min, col_sums) = partitions .par_iter() .map(|p| p.partial_bray()) .reduce(element-wise +); // finalise for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j]) } ``` ### Normalised metrics — two passes ```rust fn relfreq_bray_dist_matrix(&self) -> Array2 { // pass 1 — progressive: PartitionedDataStore::sum() // calls LayeredDataStore::sum() per partition (parallel) // calls PersistentCompactIntMatrix::sum() per layer (parallel) let global_sums = self.sum(); // pass 2 — per-partition partial using global_sums (parallel) let matrix = partitions .par_iter() .map(|p| p.partial_relfreq_bray(&global_sums)) .reduce(element-wise +); // finalise for (i,j): dist[i,j] = 1 - matrix[i,j] } ``` `global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy. --- ## Parallelism model | Level | Unit | Coordination | |---|---|---| | Across partitions | `LayeredDataStore` | none — fully independent | | Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets | | Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive | | Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only | | Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter | --- ## Query model ### Point query — `kmer → Option` ``` minimiser(kmer) → partition p for each layer l in p: slot = MphfLayer_l.find(kmer) if slot is Some: return DataStore_l.get(slot) return None ``` O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). ### Aggregation — `→ Result` ``` result = reduce( for p in partitions: // parallel for l in layers(p): // parallel partial(DataStore_p_l) ) ``` For normalised metrics replace with the two-pass scheme above. --- ## DataStore derivation Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF: ``` // count → presence/absence, parallel across (partition, layer) for (p, l) in all_partition_layer_pairs().par_iter(): count_store = open PersistentCompactIntMatrix at (p, l) presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) ``` Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. --- ## Relationship to current implementation The current `obilayeredmap` crate implements a subset of this architecture. Key divergences: - `Layer` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported. - `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally. - `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API. - The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions. Planned refactoring: 1. Extract `MphfLayer` from `Layer` as an autonomous type. 2. Replace `LayerData` trait with `DataStore` trait (no path knowledge). 3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above. 4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics. 5. Implement `PartitionedIndex` for point queries with parallel dispatch.