Kmer index architecture
+Fundamental invariant
+A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
++
Three-level hierarchy
+PartitionedIndex
+├── LayeredPartition (one per minimiser bucket)
+│ ├── MphfLayer 0 kmer → slot (immutable bijection)
+│ │ ├── DataStore A slot → T (e.g. counts)
+│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
+│ ├── MphfLayer 1
+│ │ └── DataStore A
+│ └── ...
+├── LayeredPartition
+│ └── ...
+PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
+LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.
+
MphfLayer — autonomous mapping layer
+MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
+MphfLayer::n() -> usize // number of slots
+MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
+MphfLayer::open(dir: &Path) -> OLMResult<Self>
+find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.
The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.
+
DataStore — slot-indexed data
+trait DataStore {
+ type Item;
+ fn get(&self, slot: usize) -> Self::Item;
+ fn n(&self) -> usize;
+}
+Concrete types from obicompactvec:
| Type | +Item |
+Column stats | +Use | +
|---|---|---|---|
PersistentCompactIntMatrix |
+Box<[u32]> |
+sum() -> Array1<u64> |
+count per sample per slot | +
PersistentBitMatrix |
+Box<[bool]> |
+count_ones() -> Array1<u64> |
+presence per sample per slot | +
sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.
+
Distance matrix API on DataStore types
+Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.
Full distance matrices
+Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
// PersistentCompactIntMatrix
+fn bray_dist_matrix(&self) -> Array2<f64>
+fn relfreq_bray_dist_matrix(&self) -> Array2<f64>
+fn euclidean_dist_matrix(&self) -> Array2<f64>
+fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
+fn hellinger_dist_matrix(&self) -> Array2<f64>
+fn jaccard_dist_matrix(&self) -> Array2<f64>
+fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
+
+// PersistentBitMatrix
+fn jaccard_dist_matrix(&self) -> Array2<f64>
+fn hamming_dist_matrix(&self) -> Array2<u64>
+These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.
Partial distance matrices
+Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
+Category 1 — self-contained partials: additive without any external parameter.
+// PersistentCompactIntMatrix
+fn partial_bray_dist_matrix(&self)
+ -> (Array2<u64>, // sum_min[i,j]
+ Array1<u64>) // col_sums[k]
+
+fn partial_euclidean_dist_matrix(&self) -> Array2<f64> // sum of squared diffs
+fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
+ -> (Array2<u64>, // inter[i,j]
+ Array2<u64>) // union[i,j]
+
+// PersistentBitMatrix
+fn partial_jaccard_dist_matrix(&self)
+ -> (Array2<u64>, // inter[i,j]
+ Array2<u64>) // union[i,j]
+fn partial_hamming_dist_matrix(&self) -> Array2<u64> // differing bits
+Category 2 — normalised partials: require global column sums as input, computed beforehand across all (partition, layer) pairs.
+// PersistentCompactIntMatrix only
+fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
+ -> Array2<f64> // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
+
+fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
+ -> Array2<f64> // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
+
+fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
+ -> Array2<f64> // Σ_slot (√(a/sum_i) - √(b/sum_j))²
+The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
+
Progressive aggregation principle
+Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
+PersistentCompactIntMatrix::sum() — column sums for one (partition, layer) matrix
+ ↓ Σ across layers
+LayeredCompactIntMatrix::sum() — column sums for one partition
+ ↓ Σ across partitions
+PartitionedCompactIntMatrix::sum() — global column sums
+The same cascade applies to every partial computation:
+PersistentCompactIntMatrix::partial_bray_dist_matrix() — one (partition, layer)
+ ↓ element-wise Σ across layers
+LayeredCompactIntMatrix::partial_bray() — one partition
+ ↓ element-wise Σ across partitions
+PartitionedCompactIntMatrix::partial_bray() — global partial → final dist
+This means LayeredCompactIntMatrix never inspects individual PersistentCompactIntVec columns directly, and PartitionedCompactIntMatrix never inspects individual layers. Each level presents a stable API surface to the level above.
+
LayeredDataStore — aggregation within one partition
+A LayeredDataStore holds one DataStore per layer within a single partition:
struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
+struct LayeredBitMatrix { layers: Vec<PersistentBitMatrix> }
+Column statistics
+// LayeredCompactIntMatrix
+fn sum(&self) -> Array1<u64>
+ // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
+
+// LayeredBitMatrix
+fn count_ones(&self) -> Array1<u64>
+ // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
+Self-contained partials
+Each method reduces across layers by element-wise addition of per-layer matrices:
+fn partial_bray(&self) -> (Array2<u64>, Array1<u64>)
+ // Σ_l layer_l.partial_bray_dist_matrix()
+
+fn partial_euclidean(&self) -> Array2<f64>
+ // Σ_l layer_l.partial_euclidean_dist_matrix()
+
+fn partial_jaccard(&self) -> (Array2<u64>, Array2<u64>)
+ // Σ_l layer_l.partial_jaccard_dist_matrix() [bit matrix]
+ // Σ_l layer_l.partial_threshold_jaccard_dist_matrix() [int matrix]
+
+fn partial_hamming(&self) -> Array2<u64>
+ // Σ_l layer_l.partial_hamming_dist_matrix() [bit matrix]
+Normalised partials (require global sums from above)
+fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
+ // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
+
+fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
+ // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
+
+fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
+ // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
+global_sums is provided by the PartitionedDataStore; this level does not compute it.
+
PartitionedDataStore — aggregation across all partitions
+A PartitionedDataStore holds one LayeredDataStore per partition:
struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
+struct PartitionedBitMatrix { partitions: Vec<LayeredBitMatrix> }
+Column statistics
+fn sum(&self) -> Array1<u64>
+ // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
+p.sum() is itself a reduction across layers (see above) — the cascade is preserved.
Self-contained metrics — single pass
+fn bray_dist_matrix(&self) -> Array2<f64> {
+ let (sum_min, col_sums) = partitions
+ .par_iter()
+ .map(|p| p.partial_bray())
+ .reduce(element-wise +);
+ // finalise
+ for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
+}
+Normalised metrics — two passes
+fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
+ // pass 1 — progressive: PartitionedDataStore::sum()
+ // calls LayeredDataStore::sum() per partition (parallel)
+ // calls PersistentCompactIntMatrix::sum() per layer (parallel)
+ let global_sums = self.sum();
+
+ // pass 2 — per-partition partial using global_sums (parallel)
+ let matrix = partitions
+ .par_iter()
+ .map(|p| p.partial_relfreq_bray(&global_sums))
+ .reduce(element-wise +);
+ // finalise
+ for (i,j): dist[i,j] = 1 - matrix[i,j]
+}
+global_sums is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
+
Parallelism model
+| Level | +Unit | +Coordination | +
|---|---|---|
| Across partitions | +LayeredDataStore |
+none — fully independent | +
| Across layers (self-contained) | +(partition, layer) pair |
+none — disjoint kmer sets | +
| Across layers (normalised, pass 1) | +(partition, layer) pair |
+none — sums are additive | +
| Across layers (normalised, pass 2) | +(partition, layer) pair |
+global_sums broadcast read-only | +
| Within a DataStore (distance matrix) | +upper-triangle pair (i,j) |
+none — rayon par_iter | +
+
Query model
+Point query — kmer → Option<Item>
+minimiser(kmer) → partition p
+for each layer l in p:
+ slot = MphfLayer_l.find(kmer)
+ if slot is Some:
+ return DataStore_l.get(slot)
+return None
+O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).
+Aggregation — → Result
+result = reduce(
+ for p in partitions: // parallel
+ for l in layers(p): // parallel
+ partial(DataStore_p_l)
+)
+For normalised metrics replace with the two-pass scheme above.
++
DataStore derivation
+Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
// count → presence/absence, parallel across (partition, layer)
+for (p, l) in all_partition_layer_pairs().par_iter():
+ count_store = open PersistentCompactIntMatrix at (p, l)
+ presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
+Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.
+
Relationship to current implementation
+The current obilayeredmap crate implements a subset of this architecture. Key divergences:
-
+
Layer<D: LayerData>fusesMphfLayerand oneDataStoreinto a single generic type. Multiple data stores on the same MPHF are not supported.
+LayerData::open(dir)embeds the path convention (counts/,presence/) inside the store type, preventing thePartitionedIndexfrom managing paths externally.
+LayeredDataStoreandPartitionedDataStoredo not yet exist;LayeredMapis a single-partition structure without a distance matrix API.
+- The partial distance methods exist on
PersistentCompactIntMatrixandPersistentBitMatrixand are tested; they are not yet composed across layers and partitions.
+
Planned refactoring:
+1. Extract MphfLayer from Layer<D> as an autonomous type.
+2. Replace LayerData trait with DataStore trait (no path knowledge).
+3. Implement LayeredCompactIntMatrix / LayeredBitMatrix with the partial + full distance APIs described above.
+4. Implement PartitionedCompactIntMatrix / PartitionedBitMatrix with two-pass support for normalised metrics.
+5. Implement PartitionedIndex for point queries with parallel dispatch.