Skip to content

Kmer index architecture

Fundamental invariant

A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.


Three-level hierarchy

PartitionedIndex
├── LayeredPartition  (one per minimiser bucket)
│   ├── MphfLayer 0         kmer → slot  (immutable bijection)
│   │   ├── DataStore A     slot → T     (e.g. counts)
│   │   └── DataStore B     slot → T     (e.g. presence/absence, derived)
│   ├── MphfLayer 1
│   │   └── DataStore A
│   └── ...
├── LayeredPartition
│   └── ...

PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.

LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.

MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.

DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.


MphfLayer — autonomous mapping layer

MphfLayer::find(kmer: CanonicalKmer) -> Option<usize>   // slot, or None if absent
MphfLayer::n() -> usize                                  // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>

find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.

The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.


DataStore — slot-indexed data

trait DataStore {
    type Item;
    fn get(&self, slot: usize) -> Self::Item;
    fn n(&self) -> usize;
}

Concrete types from obicompactvec:

Type Item Column stats Use
PersistentCompactIntMatrix Box<[u32]> sum() -> Array1<u64> count per sample per slot
PersistentBitMatrix Box<[bool]> count_ones() -> Array1<u64> presence per sample per slot

sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.

A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.


Distance matrix API on DataStore types

Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.

Full distance matrices

Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.

// PersistentCompactIntMatrix
fn bray_dist_matrix(&self)              -> Array2<f64>
fn relfreq_bray_dist_matrix(&self)      -> Array2<f64>
fn euclidean_dist_matrix(&self)         -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self)         -> Array2<f64>
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>

// PersistentBitMatrix
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn hamming_dist_matrix(&self)           -> Array2<u64>

These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.

Partial distance matrices

Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.

Category 1 — self-contained partials: additive without any external parameter.

// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
    -> (Array2<u64>,  // sum_min[i,j]
        Array1<u64>)  // col_sums[k]

fn partial_euclidean_dist_matrix(&self)       -> Array2<f64>   // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]

// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]
fn partial_hamming_dist_matrix(&self)         -> Array2<u64>   // differing bits

Category 2 — normalised partials: require global column sums as input, computed beforehand across all (partition, layer) pairs.

// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot min(a_slot/sum_i, b_slot/sum_j)

fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (a_slot/sum_i - b_slot/sum_j)²

fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (√(a/sum_i) - √(b/sum_j))²

The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.


Progressive aggregation principle

Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.

PersistentCompactIntMatrix::col_weights()   — column sums for one (partition, layer) matrix
        ↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights()   — column sums for one partition
        ↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights()              — global column sums

The same cascade applies to every partial:

PersistentCompactIntMatrix::partial_bray()   — one (partition, layer)
        ↓ element-wise Σ across layers
LayeredStore<PersistentCompactIntMatrix>::partial_bray()   — one partition
        ↓ element-wise Σ across partitions
LayeredStore<LayeredStore<…>>::partial_bray()              — global partial → final dist

Each level presents a stable trait surface to the level above; no level reaches two levels down.


Traits — obicompactvec::traits

Three traits unify the aggregation API across all levels of the hierarchy.

trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;
}

trait CountPartials: ColumnWeights {
    // self-contained partials (additive, no parameter)
    fn partial_bray(&self)                                          -> Array2<u64>;
    fn partial_euclidean(&self)                                     -> Array2<f64>;
    fn partial_threshold_jaccard(&self, threshold: u32)             -> (Array2<u64>, Array2<u64>);
    // normalised partials (global col_weights passed in cascade)
    fn partial_relfreq_bray(&self, global: &Array1<u64>)            -> Array2<f64>;
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)       -> Array2<f64>;
    fn partial_hellinger(&self, global: &Array1<u64>)               -> Array2<f64>;
    // provided finalisation methods (default implementations)
    fn bray_dist_matrix(&self)                                      -> Array2<f64> {  }
    fn euclidean_dist_matrix(&self)                                 -> Array2<f64> {  }
    fn threshold_jaccard_dist_matrix(&self, threshold: u32)         -> Array2<f64> {  }
    fn relfreq_bray_dist_matrix(&self)                              -> Array2<f64> {  }
    fn relfreq_euclidean_dist_matrix(&self)                         -> Array2<f64> {  }
    fn hellinger_dist_matrix(&self)                                 -> Array2<f64> {  }
}

trait BitPartials: ColumnWeights {
    fn partial_jaccard(&self)   -> (Array2<u64>, Array2<u64>);
    fn partial_hamming(&self)   -> Array2<u64>;
    // provided
    fn jaccard_dist_matrix(&self) -> Array2<f64> {  }
    fn hamming_dist_matrix(&self) -> Array2<u64> {  }
}

Leaf implementors (in obicompactvec):

Type Traits
PersistentCompactIntMatrix ColumnWeights (via sum()), CountPartials
PersistentBitMatrix ColumnWeights (via count_ones()), BitPartials

PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.


LayeredStore<S>obilayeredmap

A single generic wrapper replaces the need for named LayeredDataStore and PartitionedDataStore types:

pub struct LayeredStore<S>(Vec<S>);

Three blanket impls propagate the traits up the hierarchy:

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> {  }  // Σ across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> {  }  // same pattern
impl<S: BitPartials>   BitPartials    for LayeredStore<S> {  }  // same pattern

Because the blanket impl is recursive, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — no separate PartitionedStore type is needed:

PersistentCompactIntMatrix            implements CountPartials
LayeredStore<PersistentCompactIntMatrix>    via blanket impl (= one partition)
LayeredStore<LayeredStore<…>>              via blanket impl (= partitioned index)

Normalised metrics — two-pass cascade

The normalised finalisation methods call col_weights() first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:

// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    let global = self.col_weights();            // pass 1 — progressive sum at every level
    let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
    p.mapv(|v| 1.0 - v)                         // finalise (diagonal zeroed separately)
}

global is exact: each kmer belongs to exactly one (partition, layer) pair, so there is no double-counting across the hierarchy.


Parallelism model

Level Unit Coordination
Across partitions LayeredStore<LayeredStore<S>> inner stores none — fully independent
Across layers within a partition LayeredStore<S> inner stores none — disjoint kmer sets
Normalised pass 1 (col_weights) per inner store none — additive
Normalised pass 2 (partial) per inner store global broadcast read-only
Within a matrix (distance) upper-triangle pair (i,j) none — rayon par_iter

All levels use rayon par_iter internally; reduce_with performs a parallel tree reduction.


Query model

Point query — kmer → Option<Item>

minimiser(kmer) → partition p
for each layer l in p:
    slot = MphfLayer_l.find(kmer)
    if slot is Some:
        return DataStore_l.get(slot)
return None

O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).

Aggregation — → Result

result = reduce(
    for p in partitions:            // parallel
        for l in layers(p):         // parallel
            partial(DataStore_p_l)
)

For normalised metrics replace with the two-pass scheme above.


DataStore derivation

Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:

// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
    count_store   = open PersistentCompactIntMatrix at (p, l)
    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)

Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.


Relationship to current implementation

What is implemented

  • obicompactvec::traits: ColumnWeights, CountPartials, BitPartials are defined and implemented on PersistentCompactIntMatrix and PersistentBitMatrix.
  • obilayeredmap::LayeredStore<S>: generic wrapper with blanket impls for all three traits. LayeredStore<LayeredStore<S>> is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.

What is not yet implemented

  • Layer<D: LayerData> still fuses MphfLayer and one DataStore. Multiple data stores on the same MPHF are not supported.
  • LayeredMap is a single-partition structure without distance matrix API; it does not yet use LayeredStore.
  • No PartitionedIndex type for point queries with parallel partition dispatch.

Planned refactoring

  1. Extract MphfLayer from Layer<D> as an autonomous type.
  2. Replace LayerData trait with the DataStore / ColumnWeights / CountPartials / BitPartials system.
  3. Rewire LayeredMap to hold LayeredStore<PersistentCompactIntMatrix> (or bit variant) alongside the MPHF layers.
  4. Implement PartitionedIndex using LayeredStore<LayeredStore<S>> for data and parallel dispatch for queries.