# Kmer index architecture ## Fundamental invariant A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` PartitionedIndex ├── LayeredPartition (one per minimiser bucket) │ ├── MphfLayer 0 kmer → slot (immutable bijection) │ │ ├── DataStore A slot → T (e.g. counts) │ │ └── DataStore B slot → T (e.g. presence/absence, derived) │ ├── MphfLayer 1 │ │ └── DataStore A │ └── ... ├── LayeredPartition │ └── ... ``` **PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel. **LayeredPartition**: one directory per minimiser bucket. Holds a `Vec`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint. **MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it. **DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`. --- ## MphfLayer — autonomous mapping layer ```rust MphfLayer::find(kmer: CanonicalKmer) -> Option // slot, or None if absent MphfLayer::n() -> usize // number of slots MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin MphfLayer::open(dir: &Path) -> OLMResult ``` `find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index. The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`. --- ## DataStore — slot-indexed data ```rust trait DataStore { type Item; fn get(&self, slot: usize) -> Self::Item; fn n(&self) -> usize; } ``` Concrete types from `obicompactvec`: | Type | `Item` | Column stats | Use | |---|---|---|---| | `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1` | count per sample per slot | | `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1` | presence per sample per slot | `sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights. A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. --- ## Distance matrix API on DataStore types Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods. ### Full distance matrices Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon. ```rust // PersistentCompactIntMatrix fn bray_dist_matrix(&self) -> Array2 fn relfreq_bray_dist_matrix(&self) -> Array2 fn euclidean_dist_matrix(&self) -> Array2 fn relfreq_euclidean_dist_matrix(&self) -> Array2 fn hellinger_dist_matrix(&self) -> Array2 fn jaccard_dist_matrix(&self) -> Array2 fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 // PersistentBitMatrix fn jaccard_dist_matrix(&self) -> Array2 fn hamming_dist_matrix(&self) -> Array2 ``` These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required. ### Partial distance matrices Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible. **Category 1 — self-contained partials**: additive without any external parameter. ```rust // PersistentCompactIntMatrix fn partial_bray_dist_matrix(&self) -> (Array2, // sum_min[i,j] Array1) // col_sums[k] fn partial_euclidean_dist_matrix(&self) -> Array2 // sum of squared diffs fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32) -> (Array2, // inter[i,j] Array2) // union[i,j] // PersistentBitMatrix fn partial_jaccard_dist_matrix(&self) -> (Array2, // inter[i,j] Array2) // union[i,j] fn partial_hamming_dist_matrix(&self) -> Array2 // differing bits ``` **Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs. ```rust // PersistentCompactIntMatrix only fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot min(a_slot/sum_i, b_slot/sum_j) fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot (a_slot/sum_i - b_slot/sum_j)² fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1) -> Array2 // Σ_slot (√(a/sum_i) - √(b/sum_j))² ``` The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below. --- ## Progressive aggregation principle Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down. ``` PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix ↓ Σ across layers LayeredStore::col_weights() — column sums for one partition ↓ Σ across partitions LayeredStore>::col_weights() — global column sums ``` The same cascade applies to every partial: ``` PersistentCompactIntMatrix::partial_bray() — one (partition, layer) ↓ element-wise Σ across layers LayeredStore::partial_bray() — one partition ↓ element-wise Σ across partitions LayeredStore>::partial_bray() — global partial → final dist ``` Each level presents a stable trait surface to the level above; no level reaches two levels down. --- ## Traits — `obicompactvec::traits` Three traits unify the aggregation API across all levels of the hierarchy. ```rust trait ColumnWeights: Send + Sync { fn col_weights(&self) -> Array1; } trait CountPartials: ColumnWeights { // self-contained partials (additive, no parameter) fn partial_bray(&self) -> Array2; fn partial_euclidean(&self) -> Array2; fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); // normalised partials (global col_weights passed in cascade) fn partial_relfreq_bray(&self, global: &Array1) -> Array2; fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; fn partial_hellinger(&self, global: &Array1) -> Array2; // provided finalisation methods (default implementations) fn bray_dist_matrix(&self) -> Array2 { … } fn euclidean_dist_matrix(&self) -> Array2 { … } fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2 { … } fn relfreq_bray_dist_matrix(&self) -> Array2 { … } fn relfreq_euclidean_dist_matrix(&self) -> Array2 { … } fn hellinger_dist_matrix(&self) -> Array2 { … } } trait BitPartials: ColumnWeights { fn partial_jaccard(&self) -> (Array2, Array2); fn partial_hamming(&self) -> Array2; // provided fn jaccard_dist_matrix(&self) -> Array2 { … } fn hamming_dist_matrix(&self) -> Array2 { … } } ``` **Leaf implementors** (in `obicompactvec`): | Type | Traits | |---|---| | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` | `PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators. --- ## `LayeredStore` — `obilayeredmap` A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types: ```rust pub struct LayeredStore(Vec); ``` Three blanket impls propagate the traits up the hierarchy: ```rust impl ColumnWeights for LayeredStore { … } // Σ across inner stores impl CountPartials for LayeredStore { … } // same pattern impl BitPartials for LayeredStore { … } // same pattern ``` Because the blanket impl is recursive, **`LayeredStore>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed: ``` PersistentCompactIntMatrix implements CountPartials LayeredStore via blanket impl (= one partition) LayeredStore> via blanket impl (= partitioned index) ``` ### Normalised metrics — two-pass cascade The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic: ```rust // called on LayeredStore> fn relfreq_bray_dist_matrix(&self) -> Array2 { let global = self.col_weights(); // pass 1 — progressive sum at every level let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately) } ``` `global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy. --- ## Parallelism model | Level | Unit | Coordination | |---|---|---| | Across partitions | `LayeredStore>` inner stores | none — fully independent | | Across layers within a partition | `LayeredStore` inner stores | none — disjoint kmer sets | | Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction. --- ## Query model ### Point query — `kmer → Option` ``` minimiser(kmer) → partition p for each layer l in p: slot = MphfLayer_l.find(kmer) if slot is Some: return DataStore_l.get(slot) return None ``` O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer). ### Aggregation — `→ Result` ``` result = reduce( for p in partitions: // parallel for l in layers(p): // parallel partial(DataStore_p_l) ) ``` For normalised metrics replace with the two-pass scheme above. --- ## DataStore derivation Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF: ``` // count → presence/absence, parallel across (partition, layer) for (p, l) in all_partition_layer_pairs().par_iter(): count_store = open PersistentCompactIntMatrix at (p, l) presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir) ``` Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair. --- ## Relationship to current implementation ### What is implemented - **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`. - **`obilayeredmap::LayeredStore`**: generic wrapper with blanket impls for all three traits. `LayeredStore>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data. ### What is not yet implemented - `Layer` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported. - `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`. - No `PartitionedIndex` type for point queries with parallel partition dispatch. ### Planned refactoring 1. Extract `MphfLayer` from `Layer` as an autonomous type. 2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system. 3. Rewire `LayeredMap` to hold `LayeredStore` (or bit variant) alongside the MPHF layers. 4. Implement `PartitionedIndex` using `LayeredStore>` for data and parallel dispatch for queries.