docs: document k-mer index architecture and refactor distance metrics

Add comprehensive documentation for the `obilayeredmap` crate, `PersistentCompactIntVec`, `PersistentBitVec`, and the hierarchical k-mer index architecture, including sidebar navigation updates across all documentation pages. Refactor the Bray-Curtis distance computation in `obicompactvec` to decouple numerator and denominator calculations, replacing direct pairwise calls with explicit loops over precomputed sums. Update tests to verify column sum accuracy and align with the simplified API.
2026-05-15 21:07:23 +08:00
parent 8409c852ef
commit 45d49ed501
25 changed files with 8842 additions and 117 deletions
@@ -58,12 +58,230 @@ trait DataStore {

 Concrete types from `obicompactvec`:

-| Type | `Item` | Use |
-|---|---|---|
-| `PersistentCompactIntMatrix` | `Box<[u32]>` | count per sample per slot |
-| `PersistentBitMatrix` | `Box<[bool]>` | presence per sample per slot |
+| Type | `Item` | Column stats | Use |
+|---|---|---|---|
+| `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() -> Array1<u64>` | count per sample per slot |
+| `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() -> Array1<u64>` | presence per sample per slot |

-A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. The path to its on-disk files is managed by the `LayeredPartition`, not embedded in the store type.
+`sum()` and `count_ones()` are the bridge between the per-matrix level and cross-layer aggregation: they give the total weight of each column within one (partition, layer) pair, which can be summed to get global column weights.
+
+A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only.
+
+---
+
+## Distance matrix API on DataStore types
+
+Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` expose two families of distance matrix methods.
+
+### Full distance matrices
+
+Compute the final `n_cols × n_cols` distance matrix from data within a single matrix. Internally parallelised over the upper triangle via rayon.
+
+```rust
+// PersistentCompactIntMatrix
+fn bray_dist_matrix(&self)              -> Array2<f64>
+fn relfreq_bray_dist_matrix(&self)      -> Array2<f64>
+fn euclidean_dist_matrix(&self)         -> Array2<f64>
+fn relfreq_euclidean_dist_matrix(&self) -> Array2<f64>
+fn hellinger_dist_matrix(&self)         -> Array2<f64>
+fn jaccard_dist_matrix(&self)           -> Array2<f64>
+fn threshold_jaccard_dist_matrix(&self, threshold: u32) -> Array2<f64>
+
+// PersistentBitMatrix
+fn jaccard_dist_matrix(&self)           -> Array2<f64>
+fn hamming_dist_matrix(&self)           -> Array2<u64>
+```
+
+These are convenience methods. For a `LayeredDataStore` or `PartitionedDataStore` they cannot be used directly — the partial API is required.
+
+### Partial distance matrices
+
+Return additive components that can be summed element-wise across (partition, layer) pairs before computing the final distance. This is what makes cross-layer and cross-partition aggregation possible.
+
+**Category 1 — self-contained partials**: additive without any external parameter.
+
+```rust
+// PersistentCompactIntMatrix
+fn partial_bray_dist_matrix(&self)
+    -> (Array2<u64>,  // sum_min[i,j]
+        Array1<u64>)  // col_sums[k]
+
+fn partial_euclidean_dist_matrix(&self)       -> Array2<f64>   // sum of squared diffs
+fn partial_threshold_jaccard_dist_matrix(&self, threshold: u32)
+    -> (Array2<u64>,  // inter[i,j]
+        Array2<u64>)  // union[i,j]
+
+// PersistentBitMatrix
+fn partial_jaccard_dist_matrix(&self)
+    -> (Array2<u64>,  // inter[i,j]
+        Array2<u64>)  // union[i,j]
+fn partial_hamming_dist_matrix(&self)         -> Array2<u64>   // differing bits
+```
+
+**Category 2 — normalised partials**: require global column sums as input, computed beforehand across all (partition, layer) pairs.
+
+```rust
+// PersistentCompactIntMatrix only
+fn partial_relfreq_bray_dist_matrix(&self, col_sums: &Array1<u64>)
+    -> Array2<f64>   // Σ_slot min(a_slot/sum_i, b_slot/sum_j)
+
+fn partial_relfreq_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
+    -> Array2<f64>   // Σ_slot (a_slot/sum_i - b_slot/sum_j)²
+
+fn partial_hellinger_euclidean_dist_matrix(&self, col_sums: &Array1<u64>)
+    -> Array2<f64>   // Σ_slot (√(a/sum_i) - √(b/sum_j))²
+```
+
+The `col_sums` parameter must reflect the GLOBAL count across all layers and all partitions — passing a per-layer sum would give a wrong result. This constraint drives the two-pass algorithm described below.
+
+---
+
+## Progressive aggregation principle
+
+Aggregation is **hierarchical**: each level computes its contribution by aggregating from the level immediately below it. No level skips a level or collects raw data from two levels down.
+
+```
+PersistentCompactIntMatrix::sum()       — column sums for one (partition, layer) matrix
+        ↓ Σ across layers
+LayeredCompactIntMatrix::sum()          — column sums for one partition
+        ↓ Σ across partitions
+PartitionedCompactIntMatrix::sum()      — global column sums
+```
+
+The same cascade applies to every partial computation:
+
+```
+PersistentCompactIntMatrix::partial_bray_dist_matrix()   — one (partition, layer)
+        ↓ element-wise Σ across layers
+LayeredCompactIntMatrix::partial_bray()                   — one partition
+        ↓ element-wise Σ across partitions
+PartitionedCompactIntMatrix::partial_bray()               — global partial → final dist
+```
+
+This means `LayeredCompactIntMatrix` never inspects individual `PersistentCompactIntVec` columns directly, and `PartitionedCompactIntMatrix` never inspects individual layers. Each level presents a stable API surface to the level above.
+
+---
+
+## LayeredDataStore — aggregation within one partition
+
+A `LayeredDataStore` holds one `DataStore` per layer within a single partition:
+
+```rust
+struct LayeredCompactIntMatrix { layers: Vec<PersistentCompactIntMatrix> }
+struct LayeredBitMatrix         { layers: Vec<PersistentBitMatrix> }
+```
+
+### Column statistics
+
+```rust
+// LayeredCompactIntMatrix
+fn sum(&self) -> Array1<u64>
+    // = layers.par_iter().map(|m| m.sum()).reduce(element-wise +)
+
+// LayeredBitMatrix
+fn count_ones(&self) -> Array1<u64>
+    // = layers.par_iter().map(|m| m.count_ones()).reduce(element-wise +)
+```
+
+### Self-contained partials
+
+Each method reduces across layers by element-wise addition of per-layer matrices:
+
+```rust
+fn partial_bray(&self)          -> (Array2<u64>, Array1<u64>)
+    // Σ_l layer_l.partial_bray_dist_matrix()
+
+fn partial_euclidean(&self)      -> Array2<f64>
+    // Σ_l layer_l.partial_euclidean_dist_matrix()
+
+fn partial_jaccard(&self)        -> (Array2<u64>, Array2<u64>)
+    // Σ_l layer_l.partial_jaccard_dist_matrix()  [bit matrix]
+    // Σ_l layer_l.partial_threshold_jaccard_dist_matrix()  [int matrix]
+
+fn partial_hamming(&self)        -> Array2<u64>
+    // Σ_l layer_l.partial_hamming_dist_matrix()  [bit matrix]
+```
+
+### Normalised partials (require global sums from above)
+
+```rust
+fn partial_relfreq_bray(&self, global_sums: &Array1<u64>) -> Array2<f64>
+    // Σ_l layer_l.partial_relfreq_bray_dist_matrix(global_sums)
+
+fn partial_relfreq_euclidean(&self, global_sums: &Array1<u64>) -> Array2<f64>
+    // Σ_l layer_l.partial_relfreq_euclidean_dist_matrix(global_sums)
+
+fn partial_hellinger(&self, global_sums: &Array1<u64>) -> Array2<f64>
+    // Σ_l layer_l.partial_hellinger_euclidean_dist_matrix(global_sums)
+```
+
+`global_sums` is provided by the `PartitionedDataStore`; this level does not compute it.
+
+---
+
+## PartitionedDataStore — aggregation across all partitions
+
+A `PartitionedDataStore` holds one `LayeredDataStore` per partition:
+
+```rust
+struct PartitionedCompactIntMatrix { partitions: Vec<LayeredCompactIntMatrix> }
+struct PartitionedBitMatrix         { partitions: Vec<LayeredBitMatrix> }
+```
+
+### Column statistics
+
+```rust
+fn sum(&self) -> Array1<u64>
+    // = partitions.par_iter().map(|p| p.sum()).reduce(element-wise +)
+```
+
+`p.sum()` is itself a reduction across layers (see above) — the cascade is preserved.
+
+### Self-contained metrics — single pass
+
+```rust
+fn bray_dist_matrix(&self) -> Array2<f64> {
+    let (sum_min, col_sums) = partitions
+        .par_iter()
+        .map(|p| p.partial_bray())
+        .reduce(element-wise +);
+    // finalise
+    for (i,j): dist[i,j] = 1 - 2·sum_min[i,j] / (col_sums[i] + col_sums[j])
+}
+```
+
+### Normalised metrics — two passes
+
+```rust
+fn relfreq_bray_dist_matrix(&self) -> Array2<f64> {
+    // pass 1 — progressive: PartitionedDataStore::sum()
+    //   calls LayeredDataStore::sum() per partition (parallel)
+    //     calls PersistentCompactIntMatrix::sum() per layer (parallel)
+    let global_sums = self.sum();
+
+    // pass 2 — per-partition partial using global_sums (parallel)
+    let matrix = partitions
+        .par_iter()
+        .map(|p| p.partial_relfreq_bray(&global_sums))
+        .reduce(element-wise +);
+    // finalise
+    for (i,j): dist[i,j] = 1 - matrix[i,j]
+}
+```
+
+`global_sums` is exact because each kmer belongs to exactly one (partition, layer) pair — no double-counting. Pass 1 is itself fully parallel at every level of the hierarchy.
+
+---
+
+## Parallelism model
+
+| Level | Unit | Coordination |
+|---|---|---|
+| Across partitions | `LayeredDataStore` | none — fully independent |
+| Across layers (self-contained) | `(partition, layer)` pair | none — disjoint kmer sets |
+| Across layers (normalised, pass 1) | `(partition, layer)` pair | none — sums are additive |
+| Across layers (normalised, pass 2) | `(partition, layer)` pair | global_sums broadcast read-only |
+| Within a DataStore (distance matrix) | upper-triangle pair `(i,j)` | none — rayon par_iter |

 ---

@@ -80,65 +298,19 @@ for each layer l in p:
 return None
 ```

-O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.
+O(n_layers) MPHF probes worst case; O(1) expected. No cross-layer fusion — the result comes from exactly one (partition, layer).

-### Sequence scan — `sequence → Vec<(kmer, Option<Item>)>`
-
-Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.
-
-Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.
-
-### Aggregation — `→ Accumulator`
-
-For operations that traverse all kmers (distance, presence matrix, global counts):
+### Aggregation — `→ Result`

 ```
 result = reduce(
-    for p in partitions:             // parallel
-        for l in layers(p):          // parallel
+    for p in partitions:            // parallel
+        for l in layers(p):         // parallel
            partial(DataStore_p_l)
 )
 ```

-Each `(partition, layer)` contributes an independent `Partial`. Global result = `reduce(all partials)`.
-
---
-
-## Aggregator pattern
-
-```rust
-trait Aggregator<D: DataStore> {
-    type Partial: Send;
-    type Result;
-    fn partial(&self, store: &D) -> Self::Partial;
-    fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
-}
-```
-
-Concrete aggregators:
-
-| Aggregator | `Partial` | `Result` |
-|---|---|---|
-| `BrayCurtis(i, j)` | `(sum_min, sum_a, sum_b): (u64, u64, u64)` | `f64` |
-| `Jaccard(i, j)` | `(inter, union): (u64, u64)` | `f64` |
-| `Hellinger(i, j)` | `(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64)` | `f64` |
-| `DistanceMatrix(metric)` | `n×n partial matrix` | `n×n f64 matrix` |
-| `PresenceQuery(kmer)` | — | routed to point query |
-
-The `partial` for `BrayCurtis(i, j)` on a `PersistentCompactIntMatrix` with columns i and j already exists as `PersistentCompactIntVec::partial_bray_dist` — it needs to be lifted to the column-pair level on the matrix.
-
---
-
-## Parallelism model
-
-| Level | Unit | Coordination |
-|---|---|---|
-| Across partitions | `LayeredPartition` | none — fully independent |
-| Across layers (aggregation) | `(partition, layer)` pair | none — disjoint kmer sets |
-| Within a layer (point query) | n/a — single layer per kmer | n/a |
-| DataStore derivation | one `(partition, layer)` per task | none |
-
-The dispatch model: `PartitionedIndex::aggregate(aggregator)` fans out over partitions (rayon `par_iter`), each partition fans out over its layers, collects partials, then a top-level `reduce` combines.
+For normalised metrics replace with the two-pass scheme above.

 ---

@@ -149,17 +321,11 @@ Because the `MphfLayer` is independent of its data stores, new stores can be der
 ```
 // count → presence/absence, parallel across (partition, layer)
 for (p, l) in all_partition_layer_pairs().par_iter():
-    count_store = open PersistentCompactIntMatrix at (p, l)
+    count_store   = open PersistentCompactIntMatrix at (p, l)
    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
-    attach presence_store to MphfLayer(p, l)
 ```

-Other derivations:
- Threshold a count matrix → binary presence matrix
- Union two presence matrices (same MPHF, different samples)
- Merge two count matrices (saturating add, column-wise)
-
-All derivations are local to a `(partition, layer)` pair and fully parallelisable.
+Other derivations: threshold a count matrix → binary presence matrix; union two presence matrices; merge two count matrices (saturating add, column-wise). All are local to one `(partition, layer)` pair.

 ---

@@ -169,11 +335,12 @@ The current `obilayeredmap` crate implements a subset of this architecture. Key

 - `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
 - `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
- The `Aggregator` pattern is not yet implemented; partial distance methods exist on `PersistentCompactIntVec` but are not composed across layers and partitions.
- No `PartitionedIndex` type exists; `LayeredMap` is a single-partition structure.
+- `LayeredDataStore` and `PartitionedDataStore` do not yet exist; `LayeredMap` is a single-partition structure without a distance matrix API.
+- The partial distance methods exist on `PersistentCompactIntMatrix` and `PersistentBitMatrix` and are tested; they are not yet composed across layers and partitions.

 Planned refactoring:
 1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
 2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
-3. Implement `LayeredPartition` that holds `Vec<MphfLayer>` and attaches data stores externally.
-4. Implement `PartitionedIndex` with parallel dispatch and the `Aggregator` pattern.
+3. Implement `LayeredCompactIntMatrix` / `LayeredBitMatrix` with the partial + full distance APIs described above.
+4. Implement `PartitionedCompactIntMatrix` / `PartitionedBitMatrix` with two-pass support for normalised metrics.
+5. Implement `PartitionedIndex` for point queries with parallel dispatch.