obikmer/docmd/architecture/index_architecture.md

# Kmer index architecture

## Fundamental invariant

A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.

---

## Three-level hierarchy

```
PartitionedIndex
├── LayeredPartition  (one per minimiser bucket)
│   ├── MphfLayer 0         kmer → slot  (immutable bijection)
│   │   ├── DataStore A     slot → T     (e.g. counts)
│   │   └── DataStore B     slot → T     (e.g. presence/absence, derived)
│   ├── MphfLayer 1
│   │   └── DataStore A
│   └── ...
├── LayeredPartition
│   └── ...
```

**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.

**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.

**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.

**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.

---

## MphfLayer — autonomous mapping layer

```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize>   // slot, or None if absent
MphfLayer::n() -> usize                                  // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
```

`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.

The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.

---

## DataStore — slot-indexed data

```rust
trait DataStore {
    type Item;
    fn get(&self, slot: usize) -> Self::Item;
    fn n(&self) -> usize;
}
```

Concrete types from `obicompactvec`:

| Type | `Item` | Column stats | Use |
|---|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |

`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.

A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.

---

## Distance matrix API on DataStore types

Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.

### Full distance matrices

Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.

```rust
// PersistentCompactIntMatrix
fn bray_dist_matrix(&self)              -> Array2<f64>
fn relfreq_bray_dist_matrix(&self)      -> Array2<f64>
fn euclidean_dist_matrix(&self)         -> Array2<f64>
fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
fn hellinger_dist_matrix(&self)         -> Array2<f64>
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>

// PersistentBitMatrix
fn jaccard_dist_matrix(&self)           -> Array2<f64>
fn hamming_dist_matrix(&self)           -> Array2<u64>
```

These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.

### Partial distance matrices

Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.

**Category 1 — self-contained partials**: additive without any external parameter.

```rust
// PersistentCompactIntMatrix
fn partial_bray_dist_matrix(&self)
    -> (Array2<u64>,  // sum_min[i,j]
        Array1<u64>)  // col_sums[k]

fn partial_euclidean_dist_matrix(&self)       -> Array2<f64>   // sum of squared diffs
fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]

// PersistentBitMatrix
fn partial_jaccard_dist_matrix(&self)
    -> (Array2<u64>,  // inter[i,j]
        Array2<u64>)  // union[i,j]
fn partial_hamming_dist_matrix(&self)         -> Array2<u64>   // differing bits
```

**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.

```rust
// PersistentCompactIntMatrix only
fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot min(a_slot/sum_i, b_slot/sum_j)

fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (a_slot/sum_i - b_slot/sum_j)²

fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
    -> Array2<f64>   // Σ_slot (√(a/sum_i) - √(b/sum_j))²
```

The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.

---

## Progressive aggregation principle

Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.

```
PersistentCompactIntMatrix::col_weights()   — column sums for one (partition, layer) matrix
        ↓ Σ across layers
LayeredStore<PersistentCompactIntMatrix>::col_weights()   — column sums for one partition
        ↓ Σ across partitions
LayeredStore<LayeredStore<…>>::col_weights()              — global column sums
```

The same cascade applies to every partial:

```
PersistentCompactIntMatrix::partial_bray()   — one (partition, layer)
        ↓ element-wise Σ across layers
LayeredStore<PersistentCompactIntMatrix>::partial_bray()   — one partition
        ↓ element-wise Σ across partitions
LayeredStore<LayeredStore<…>>::partial_bray()              — global partial → final dist
```

Each level presents a stable trait surface to the level above; no level reaches two levels down.

---

## Traits — `obicompactvec::traits`

Three traits unify the aggregation API across all levels of the hierarchy.

```rust
trait ColumnWeights: Send + Sync {
    fn col_weights(&self) -> Array1<u64>;
}

trait CountPartials: ColumnWeights {
    // self-contained partials (additive, no parameter)
    fn partial_bray(&self)                                          -> Array2<u64>;
    fn partial_euclidean(&self)                                     -> Array2<f64>;
    fn partial_threshold_jaccard(&self, threshold: u32)             -> (Array2<u64>, Array2<u64>);
    // normalised partials (global col_weights passed in cascade)
    fn partial_relfreq_bray(&self, global: &Array1<u64>)            -> Array2<f64>;
    fn partial_relfreq_euclidean(&self, global: &Array1<u64>)       -> Array2<f64>;
    fn partial_hellinger(&self, global: &Array1<u64>)               -> Array2<f64>;
    // provided finalisation methods (default implementations)
    fn bray_dist_matrix(&self)                                      -> Array2<f64> { … }
    fn euclidean_dist_matrix(&self)                                 -> Array2<f64> { … }
    fn threshold_jaccard_dist_matrix(&self, threshold: u32)         -> Array2<f64> { … }
    fn relfreq_bray_dist_matrix(&self)                              -> Array2<f64> { … }
    fn relfreq_euclidean_dist_matrix(&self)                         -> Array2<f64> { … }
    fn hellinger_dist_matrix(&self)                                 -> Array2<f64> { … }
}

trait BitPartials: ColumnWeights {
    fn partial_jaccard(&self)   -> (Array2<u64>, Array2<u64>);
    fn partial_hamming(&self)   -> Array2<u64>;
    // provided
    fn jaccard_dist_matrix(&self) -> Array2<f64> { … }
    fn hamming_dist_matrix(&self) -> Array2<u64> { … }
}
```

**Leaf implementors** (in `obicompactvec`):

| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`), `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`), `BitPartials` |

`PersistentCompactIntVec` and `PersistentBitVec` do **not** implement these traits — they are single-column primitives, not matrix-level aggregators.

---

## `LayeredStore<S>` — `obilayeredmap`

A single generic wrapper replaces the need for named `LayeredDataStore` and `PartitionedDataStore` types:

```rust
pub struct LayeredStore<S>(Vec<S>);
```

Three blanket impls propagate the traits up the hierarchy:

```rust
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // same pattern
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // same pattern
```

Because the blanket impl is recursive, **`LayeredStore<LayeredStore<S>>`** automatically inherits all three traits when `S` does — no separate `PartitionedStore` type is needed:

```
PersistentCompactIntMatrix            implements CountPartials
LayeredStore<PersistentCompactIntMatrix>    via blanket impl (= one partition)
LayeredStore<LayeredStore<…>>              via blanket impl (= partitioned index)
```

### Normalised metrics — two-pass cascade

The normalised finalisation methods call `col_weights()` first (pass 1), then the normalised partial (pass 2). Both calls go through the same blanket impl, so the cascade is automatic:

```rust
// called on LayeredStore<LayeredStore<PersistentCompactIntMatrix>>
fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
    let global = self.col_weights();            // pass 1 — progressive sum at every level
    let p = self.partial_relfreq_bray(&global); // pass 2 — global passed in cascade
    p.mapv(|v| 1.0 - v)                         // finalise (diagonal zeroed separately)
}
```

`global` is exact: each kmer belongs to exactly one `(partition, layer)` pair, so there is no double-counting across the hierarchy.

---

## Parallelism model

| Level | Unit | Coordination |
|---|---|---|
| Across partitions | `LayeredStore<LayeredStore<S>>` inner stores | none — fully independent |
| Across layers within a partition | `LayeredStore<S>` inner stores | none — disjoint kmer sets |
| Normalised pass 1 (`col_weights`) | per inner store | none — additive |
| Normalised pass 2 (partial) | per inner store | `global` broadcast read-only |
| Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` |

All levels use rayon `par_iter` internally; `reduce_with` performs a parallel tree reduction.

---

## Query model

### Point query — `kmer → Option<Item>`

```
minimiser(kmer) → partition p
for each layer l in p:
    slot = MphfLayer_l.find(kmer)
    if slot is Some:
        return DataStore_l.get(slot)
return None
```

O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).

### Aggregation — `→ Result`

```
result = reduce(
    for p in partitions:            // parallel
        for l in layers(p):         // parallel
            partial(DataStore_p_l)
)
```

For normalised metrics replace with the two-pass scheme above.

---

## DataStore derivation

Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:

```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
    count_store   = open PersistentCompactIntMatrix at (p, l)
    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
```

Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.

---

## Relationship to current implementation

### What is implemented

- **`obicompactvec::traits`**: `ColumnWeights`, `CountPartials`, `BitPartials` are defined and implemented on `PersistentCompactIntMatrix` and `PersistentBitMatrix`.
- **`obilayeredmap::LayeredStore<S>`**: generic wrapper with blanket impls for all three traits. `LayeredStore<LayeredStore<S>>` is the partitioned level — no separate type needed. Tests confirm that splitting data across layers and across partitions gives the same distance matrices as computing on flat combined data.

### What is not yet implemented

- `Layer<D: LayerData>` still fuses `MphfLayer` and one `DataStore`. Multiple data stores on the same MPHF are not supported.
- `LayeredMap` is a single-partition structure without distance matrix API; it does not yet use `LayeredStore`.
- No `PartitionedIndex` type for point queries with parallel partition dispatch.

### Planned refactoring

1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.

---

## Multi-genome column invariant

### The invariant

After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).

### How it is maintained

The invariant is established and preserved by three coordinated operations:

**1. Existing layers — column append.**
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.

**2. New layers — absent columns prepended.**
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.

**3. First merge, Presence mode — `init_presence_matrix`.**
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.

### Why the invariant is required

The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.

The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.