Files
obikmer/docmd/architecture/index_architecture.md
T
Eric Coissac 45d49ed501 docs: document k-mer index architecture and refactor distance metrics
Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:24:30 +08:00

13 KiB
Raw Blame History

Kmer index architecture

Fundamental invariant

A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.


Three-level hierarchy

PartitionedIndex
├── LayeredPartition  (one per minimiser bucket)
│   ├── MphfLayer 0         kmer → slot  (immutable bijection)
│   │   ├── DataStore A     slot → T     (e.g. counts)
│   │   └── DataStore B     slot → T     (e.g. presence/absence, derived)
│   ├── MphfLayer 1
│   │   └── DataStore A
│   └── ...
├── LayeredPartition
│   └── ...

PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.

LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.

MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.

DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.


MphfLayer — autonomous mapping layer

MphfLayer::find(kmer: CanonicalKmer) -> Option<usize>   // slot, or None if absent
MphfLayer::n() -> usize                                  // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>

find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.

The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.


DataStore — slot-indexed data

trait DataStore {
    type Item;
    fn get(&self, slot: usize) -> Self::Item;
    fn n(&self) -> usize;
}

Concrete types from obicompactvec:

Type Item Column stats Use
PersistentCompactIntMatrix Box<[u32]> sum() -> Array1<u64> count per sample per slot
PersistentBitMatrix Box<[bool]> count_ones() -> Array1<u64> presence per sample per slot

sum() and count_ones() are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.

A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only.


Distance matrix API on DataStore types

Both PersistentCompactIntMatrix and PersistentBitMatrix expose two families of distance matrix methods.

Full distance matrices

Compute the final n_cols × n_cols distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.

// PersistentCompactIntMatrix
fn bray_dist_matrix(&self)              -> Array2<f64>
fn relfreq_bray_dist_matrix(&self)      -> Array2<f64>
fn euclidean_dist_matrix(&self)         -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self)         -> Array2<f64>
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>

// PersistentBitMatrix
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn hamming_dist_matrix(&self)           -> Array2<u64>

These are convenience methods. For a LayeredDataStore or PartitionedDataStore they cannot be used directly — the partial API is required.

Partial distance matrices

Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.

Category 1 — self-contained partials: additive without any external parameter.

// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
    -> (Array2<u64>,  // sum_min[i,j]
        Array1<u64>)  // col_sums[k]

fn partial_euclidean_dist_matrix(&self)       -> Array2<f64>   // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]

// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]
fn partial_hamming_dist_matrix(&self)         -> Array2<u64>   // differing bits

Category 2 — normalised partials: require global column sums as input, computed beforehand across all (partition, layer) pairs.

// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot min(a_slot/sum_i, b_slot/sum_j)

fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (a_slot/sum_i - b_slot/sum_j)²

fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (√(a/sum_i) - √(b/sum_j))²

The col_sums parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.


Progressive aggregation principle

Aggregation is hierarchical: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.

PersistentCompactIntMatrix::sum()       — column sums for one (partition, layer) matrix
        ↓ Σ across layers
LayeredCompactIntMatrix::sum()          — column sums for one partition
        ↓ Σ across partitions
PartitionedCompactIntMatrix::sum()      — global column sums

The same cascade applies to every partial computation:

PersistentCompactIntMatrix::partial_bray_dist_matrix()   — one (partition, layer)
        ↓ element-wise Σ across layers
LayeredCompactIntMatrix::partial_bray()                   — one partition
        ↓ element-wise Σ across partitions
PartitionedCompactIntMatrix::partial_bray()               — global partial → final dist

This means LayeredCompactIntMatrix never inspects individual PersistentCompactIntVec columns directly, and PartitionedCompactIntMatrix never inspects individual layers. Each level presents a stable API surface to the level above.


LayeredDataStore — aggregation within one partition

A LayeredDataStore holds one DataStore per layer within a single partition:

struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
struct LayeredBitMatrix         { layers: Vec<PersistentBitMatrix> }

Column statistics

// LayeredCompactIntMatrix
fn sum(&self) -> Array1<u64>
    // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)

// LayeredBitMatrix
fn count_ones(&self) -> Array1<u64>
    // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)

Self-contained partials

Each method reduces across layers by element-wise addition of per-layer matrices:

fn partial_bray(&self)          -> (Array2<u64>, Array1<u64>)
    // Σ_l layer_l.partial_bray_dist_matrix()

fn partial_euclidean(&self)      -> Array2<f64>
    // Σ_l layer_l.partial_euclidean_dist_matrix()

fn partial_jaccard(&self)        -> (Array2<u64>, Array2<u64>)
    // Σ_l layer_l.partial_jaccard_dist_matrix()  [bit matrix]
    // Σ_l layer_l.partial_threshold_jaccard_dist_matrix()  [int matrix]

fn partial_hamming(&self)        -> Array2<u64>
    // Σ_l layer_l.partial_hamming_dist_matrix()  [bit matrix]

Normalised partials (require global sums from above)

fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
    // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)

fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
    // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)

fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
    // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)

global_sums is provided by the PartitionedDataStore; this level does not compute it.


PartitionedDataStore — aggregation across all partitions

A PartitionedDataStore holds one LayeredDataStore per partition:

struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
struct PartitionedBitMatrix         { partitions: Vec<LayeredBitMatrix> }

Column statistics

fn sum(&self) -> Array1<u64>
    // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)

p.sum() is itself a reduction across layers (see above) — the cascade is preserved.

Self-contained metrics — single pass

fn bray_dist_matrix(&self) -> Array2<f64> {
    let (sum_min, col_sums) = partitions
        .par_iter()
        .map(|p| p.partial_bray())
        .reduce(element-wise +);
    // finalise
    for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
}

Normalised metrics — two passes

fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    // pass 1 — progressive: PartitionedDataStore::sum()
    //   calls LayeredDataStore::sum() per partition (parallel)
    //     calls PersistentCompactIntMatrix::sum() per layer (parallel)
    let global_sums = self.sum();

    // pass 2 — per-partition partial using global_sums (parallel)
    let matrix = partitions
        .par_iter()
        .map(|p| p.partial_relfreq_bray(&global_sums))
        .reduce(element-wise +);
    // finalise
    for (i,j): dist[i,j] = 1 - matrix[i,j]
}

global_sums is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.


Parallelism model

Level Unit Coordination
Across partitions LayeredDataStore none — fully independent
Across layers (self-contained) (partition, layer) pair none — disjoint kmer sets
Across layers (normalised, pass 1) (partition, layer) pair none — sums are additive
Across layers (normalised, pass 2) (partition, layer) pair global_sums broadcast read-only
Within a DataStore (distance matrix) upper-triangle pair (i,j) none — rayon par_iter

Query model

Point query — kmer → Option<Item>

minimiser(kmer) → partition p
for each layer l in p:
    slot = MphfLayer_l.find(kmer)
    if slot is Some:
        return DataStore_l.get(slot)
return None

O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).

Aggregation — → Result

result = reduce(
    for p in partitions:            // parallel
        for l in layers(p):         // parallel
            partial(DataStore_p_l)
)

For normalised metrics replace with the two-pass scheme above.


DataStore derivation

Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:

// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
    count_store   = open PersistentCompactIntMatrix at (p, l)
    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)

Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one (partition, layer) pair.


Relationship to current implementation

The current obilayeredmap crate implements a subset of this architecture. Key divergences:

  • Layer<D: LayerData> fuses MphfLayer and one DataStore into a single generic type. Multiple data stores on the same MPHF are not supported.
  • LayerData::open(dir) embeds the path convention (counts/, presence/) inside the store type, preventing the PartitionedIndex from managing paths externally.
  • LayeredDataStore and PartitionedDataStore do not yet exist; LayeredMap is a single-partition structure without a distance matrix API.
  • The partial distance methods exist on PersistentCompactIntMatrix and PersistentBitMatrix and are tested; they are not yet composed across layers and partitions.

Planned refactoring:

  1. Extract MphfLayer from Layer<D> as an autonomous type.
  2. Replace LayerData trait with DataStore trait (no path knowledge).
  3. Implement LayeredCompactIntMatrix / LayeredBitMatrix with the partial + full distance APIs described above.
  4. Implement PartitionedCompactIntMatrix / PartitionedBitMatrix with two-pass support for normalised metrics.
  5. Implement PartitionedIndex for point queries with parallel dispatch.