Kmer index architecture
Fundamental invariant
A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
Three-level hierarchy
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.
MphfLayer — autonomous mapping layer
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.
The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.
DataStore — slot-indexed data
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
Concrete types from obicompactvec:
| Type | Item |
Column stats | Use |
|---|---|---|---|
PersistentCompactIntMatrix |
Box<[u32]> |
sum() -> Array1<u64> |
count per sample per slot |
PersistentBitMatrix |
Box<[bool]> |
count_ones() -> Array1<u64> |
presence per sample per slot |
sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.
Distance matrix API on DataStore types
Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.
Full distance matrices
Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self) -> Array2<f64>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
fn euclidean_dist_matrix(&self) -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self) -> Array2<f64>
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
// PersistentBitMatrix
fn jaccard_dist_matrix(&self) -> Array2<f64>
fn hamming_dist_matrix(&self) -> Array2<u64>
These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.
Partial distance matrices
Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
Category 1 — self-contained partials: additive without any external parameter.
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
-> (Array2<u64>, // sum_min[i,j]
Array1<u64>) // col_sums[k]
fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
-> (Array2<u64>, // inter[i,j]
Array2<u64>) // union[i,j]
fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
Category 2 — normalised partials: require global column sums as input, computed beforehand across all (partition, layer) pairs.
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
-> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
Progressive aggregation principle
Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
PersistentCompactIntMatrix::col_weights() — column sums for one (partition, layer) matrix
↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights() — column sums for one partition
↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights() — global column sums
The same cascade applies to every partial:
PersistentCompactIntMatrix::partial_bray() — one (partition, layer)
↓ element-wise Σ across layers
LayeredStore<PersistentCompactIntMatrix>::partial_bray() — one partition
↓ element-wise Σ across partitions
LayeredStore<LayeredStore<…>>::partial_bray() — global partial → final dist
Each level presents a stable trait surface to the level above; no level reaches two levels down.
Traits — obicompactvec::traits
Three traits unify the aggregation API across all levels of the hierarchy.
trait ColumnWeights: Send + Sync {
fn col_weights(&self) -> Array1<u64>;
}
trait CountPartials: ColumnWeights {
// self-contained partials (additive, no parameter)
fn partial_bray(&self) -> Array2<u64>;
fn partial_euclidean(&self) -> Array2<f64>;
fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2<u64>, Array2<u64>);
// normalised partials (global col_weights passed in cascade)
fn partial_relfreq_bray(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_relfreq_euclidean(&self, global: &Array1<u64>) -> Array2<f64>;
fn partial_hellinger(&self, global: &Array1<u64>) -> Array2<f64>;
// provided finalisation methods (default implementations)
fn bray_dist_matrix(&self) -> Array2<f64> { … }
fn euclidean_dist_matrix(&self) -> Array2<f64> { … }
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64> { … }
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> { … }
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64> { … }
fn hellinger_dist_matrix(&self) -> Array2<f64> { … }
}
trait BitPartials: ColumnWeights {
fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>);
fn partial_hamming(&self) -> Array2<u64>;
// provided
fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
fn hamming_dist_matrix(&self) -> Array2<u64> { … }
}
Leaf implementors (in obicompactvec):
| Type | Traits |
|---|---|
PersistentCompactIntMatrix |
ColumnWeights (via sum()), CountPartials |
PersistentBitMatrix |
ColumnWeights (via count_ones()), BitPartials |
PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.
LayeredStore<S> — obilayeredmap
A single generic wrapper replaces the need for named LayeredDataStore and PartitionedDataStore types:
pub struct LayeredStore<S>(Vec<S>);
Three blanket impls propagate the traits up the hierarchy:
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // same pattern
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // same pattern
Because the blanket impl is recursive, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — no separate PartitionedStore type is needed:
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (= one partition)
LayeredStore<LayeredStore<…>> via blanket impl (= partitioned index)
Normalised metrics — two-pass cascade
The normalised finalisation methods call col_weights() first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
let global = self.col_weights(); // pass 1 — progressive sum at every level
let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
p.mapv(|v| 1.0 - v) // finalise (diagonal zeroed separately)
}
global is exact: each kmer belongs to exactly one (partition, layer) pair, so there is no double-counting across the hierarchy.
Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | LayeredStore<LayeredStore<S>> inner stores |
none — fully independent |
| Across layers within a partition | LayeredStore<S> inner stores |
none — disjoint kmer sets |
Normalised pass 1 (col_weights) |
per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | global broadcast read-only |
| Within a matrix (distance) | upper-triangle pair (i,j) |
none — rayon par_iter |
All levels use rayon par_iter internally; reduce_with performs a parallel tree reduction.
Query model
Point query — kmer → Option<Item>
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
Aggregation — → Result
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
For normalised metrics replace with the two-pass scheme above.
DataStore derivation
Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.
Relationship to current implementation
What is implemented
obicompactvec::traits:ColumnWeights,CountPartials,BitPartialsare defined and implemented onPersistentCompactIntMatrixandPersistentBitMatrix.obilayeredmap::LayeredStore<S>: generic wrapper with blanket impls for all three traits.LayeredStore<LayeredStore<S>>is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.
What is not yet implemented
Layer<D: LayerData>still fusesMphfLayerand oneDataStore. Multiple data stores on the same MPHF are not supported.LayeredMapis a single-partition structure without distance matrix API; it does not yet useLayeredStore.- No
PartitionedIndextype for point queries with parallel partition dispatch.
Planned refactoring
- Extract
MphfLayerfromLayer<D>as an autonomous type. - Replace
LayerDatatrait with theDataStore/ColumnWeights/CountPartials/BitPartialssystem. - Rewire
LayeredMapto holdLayeredStore<PersistentCompactIntMatrix>(or bit variant) alongside the MPHF layers. - Implement
PartitionedIndexusingLayeredStore<LayeredStore<S>>for data and parallel dispatch for queries.