Files
obikmer/docmd/architecture/index_architecture.md
T
Eric Coissac 1881e98bad feat(bitvec): add partial Jaccard, fix padding, optimize constructor
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
2026-05-15 15:04:10 +08:00

180 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kmer index architecture
## Fundamental invariant
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
---
## Three-level hierarchy
```
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
```
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
---
## MphfLayer — autonomous mapping layer
```rust
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
```
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
---
## DataStore — slot-indexed data
```rust
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
```
Concrete types from `obicompactvec`:
| Type | `Item` | Use |
|---|---|---|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | count per sample per slot |
| `PersistentBitMatrix` | `Box<[bool]>` | presence per sample per slot |
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. The path to its on-disk files is managed by the `LayeredPartition`, not embedded in the store type.
---
## Query model
### Point query — `kmer → Option<Item>`
```
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
```
O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.
### Sequence scan — `sequence → Vec<(kmer, Option<Item>)>`
Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.
Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.
### Aggregation — `→ Accumulator`
For operations that traverse all kmers (distance, presence matrix, global counts):
```
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
```
Each `(partition, layer)` contributes an independent `Partial`. Global result = `reduce(all partials)`.
---
## Aggregator pattern
```rust
trait Aggregator<D: DataStore> {
type Partial: Send;
type Result;
fn partial(&self, store: &D) -> Self::Partial;
fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
}
```
Concrete aggregators:
| Aggregator | `Partial` | `Result` |
|---|---|---|
| `BrayCurtis(i, j)` | `(sum_min, sum_a, sum_b): (u64, u64, u64)` | `f64` |
| `Jaccard(i, j)` | `(inter, union): (u64, u64)` | `f64` |
| `Hellinger(i, j)` | `(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64)` | `f64` |
| `DistanceMatrix(metric)` | `n×n partial matrix` | `n×n f64 matrix` |
| `PresenceQuery(kmer)` | — | routed to point query |
The `partial` for `BrayCurtis(i, j)` on a `PersistentCompactIntMatrix` with columns i and j already exists as `PersistentCompactIntVec::partial_bray_dist` — it needs to be lifted to the column-pair level on the matrix.
---
## Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | `LayeredPartition` | none — fully independent |
| Across layers (aggregation) | `(partition, layer)` pair | none — disjoint kmer sets |
| Within a layer (point query) | n/a — single layer per kmer | n/a |
| DataStore derivation | one `(partition, layer)` per task | none |
The dispatch model: `PartitionedIndex::aggregate(aggregator)` fans out over partitions (rayon `par_iter`), each partition fans out over its layers, collects partials, then a top-level `reduce` combines.
---
## DataStore derivation
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
```
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
attach presence_store to MphfLayer(p, l)
```
Other derivations:
- Threshold a count matrix → binary presence matrix
- Union two presence matrices (same MPHF, different samples)
- Merge two count matrices (saturating add, column-wise)
All derivations are local to a `(partition, layer)` pair and fully parallelisable.
---
## Relationship to current implementation
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
- The `Aggregator` pattern is not yet implemented; partial distance methods exist on `PersistentCompactIntVec` but are not composed across layers and partitions.
- No `PartitionedIndex` type exists; `LayeredMap` is a single-partition structure.
Planned refactoring:
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
3. Implement `LayeredPartition` that holds `Vec<MphfLayer>` and attaches data stores externally.
4. Implement `PartitionedIndex` with parallel dispatch and the `Aggregator` pattern.