180 lines
7.1 KiB
Markdown
180 lines
7.1 KiB
Markdown
|
|
# Kmer index architecture
|
|||
|
|
|
|||
|
|
## Fundamental invariant
|
|||
|
|
|
|||
|
|
A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Three-level hierarchy
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
PartitionedIndex
|
|||
|
|
├── LayeredPartition (one per minimiser bucket)
|
|||
|
|
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
|
|||
|
|
│ │ ├── DataStore A slot → T (e.g. counts)
|
|||
|
|
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
|
|||
|
|
│ ├── MphfLayer 1
|
|||
|
|
│ │ └── DataStore A
|
|||
|
|
│ └── ...
|
|||
|
|
├── LayeredPartition
|
|||
|
|
│ └── ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**PartitionedIndex**: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
|
|||
|
|
|
|||
|
|
**LayeredPartition**: one directory per minimiser bucket. Holds a `Vec<MphfLayer>`. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
|
|||
|
|
|
|||
|
|
**MphfLayer**: the MPHF + evidence + unitig spine. Maps `kmer → slot` for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
|
|||
|
|
|
|||
|
|
**DataStore**: a slot-indexed data array (e.g. `PersistentCompactIntMatrix`, `PersistentBitMatrix`). Attached to a `MphfLayer` externally. Multiple stores of different types can coexist on the same `MphfLayer`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## MphfLayer — autonomous mapping layer
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
|
|||
|
|
MphfLayer::n() -> usize // number of slots
|
|||
|
|
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
|
|||
|
|
MphfLayer::open(dir: &Path) -> OLMResult<Self>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`find` returns `Some(slot)` only if the kmer is actually in this layer (evidence check included). Returns `None` for kmers present in other layers or absent from the index.
|
|||
|
|
|
|||
|
|
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same `MphfLayer`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## DataStore — slot-indexed data
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
trait DataStore {
|
|||
|
|
type Item;
|
|||
|
|
fn get(&self, slot: usize) -> Self::Item;
|
|||
|
|
fn n(&self) -> usize;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Concrete types from `obicompactvec`:
|
|||
|
|
|
|||
|
|
| Type | `Item` | Use |
|
|||
|
|
|---|---|---|
|
|||
|
|
| `PersistentCompactIntMatrix` | `Box<[u32]>` | count per sample per slot |
|
|||
|
|
| `PersistentBitMatrix` | `Box<[bool]>` | presence per sample per slot |
|
|||
|
|
|
|||
|
|
A `DataStore` knows nothing about kmers or MPHFs. It is indexed by `usize` slot only. The path to its on-disk files is managed by the `LayeredPartition`, not embedded in the store type.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Query model
|
|||
|
|
|
|||
|
|
### Point query — `kmer → Option<Item>`
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
minimiser(kmer) → partition p
|
|||
|
|
for each layer l in p:
|
|||
|
|
slot = MphfLayer_l.find(kmer)
|
|||
|
|
if slot is Some:
|
|||
|
|
return DataStore_l.get(slot)
|
|||
|
|
return None
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.
|
|||
|
|
|
|||
|
|
### Sequence scan — `sequence → Vec<(kmer, Option<Item>)>`
|
|||
|
|
|
|||
|
|
Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.
|
|||
|
|
|
|||
|
|
Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.
|
|||
|
|
|
|||
|
|
### Aggregation — `→ Accumulator`
|
|||
|
|
|
|||
|
|
For operations that traverse all kmers (distance, presence matrix, global counts):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
result = reduce(
|
|||
|
|
for p in partitions: // parallel
|
|||
|
|
for l in layers(p): // parallel
|
|||
|
|
partial(DataStore_p_l)
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Each `(partition, layer)` contributes an independent `Partial`. Global result = `reduce(all partials)`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Aggregator pattern
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
trait Aggregator<D: DataStore> {
|
|||
|
|
type Partial: Send;
|
|||
|
|
type Result;
|
|||
|
|
fn partial(&self, store: &D) -> Self::Partial;
|
|||
|
|
fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Concrete aggregators:
|
|||
|
|
|
|||
|
|
| Aggregator | `Partial` | `Result` |
|
|||
|
|
|---|---|---|
|
|||
|
|
| `BrayCurtis(i, j)` | `(sum_min, sum_a, sum_b): (u64, u64, u64)` | `f64` |
|
|||
|
|
| `Jaccard(i, j)` | `(inter, union): (u64, u64)` | `f64` |
|
|||
|
|
| `Hellinger(i, j)` | `(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64)` | `f64` |
|
|||
|
|
| `DistanceMatrix(metric)` | `n×n partial matrix` | `n×n f64 matrix` |
|
|||
|
|
| `PresenceQuery(kmer)` | — | routed to point query |
|
|||
|
|
|
|||
|
|
The `partial` for `BrayCurtis(i, j)` on a `PersistentCompactIntMatrix` with columns i and j already exists as `PersistentCompactIntVec::partial_bray_dist` — it needs to be lifted to the column-pair level on the matrix.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Parallelism model
|
|||
|
|
|
|||
|
|
| Level | Unit | Coordination |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Across partitions | `LayeredPartition` | none — fully independent |
|
|||
|
|
| Across layers (aggregation) | `(partition, layer)` pair | none — disjoint kmer sets |
|
|||
|
|
| Within a layer (point query) | n/a — single layer per kmer | n/a |
|
|||
|
|
| DataStore derivation | one `(partition, layer)` per task | none |
|
|||
|
|
|
|||
|
|
The dispatch model: `PartitionedIndex::aggregate(aggregator)` fans out over partitions (rayon `par_iter`), each partition fans out over its layers, collects partials, then a top-level `reduce` combines.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## DataStore derivation
|
|||
|
|
|
|||
|
|
Because the `MphfLayer` is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
// count → presence/absence, parallel across (partition, layer)
|
|||
|
|
for (p, l) in all_partition_layer_pairs().par_iter():
|
|||
|
|
count_store = open PersistentCompactIntMatrix at (p, l)
|
|||
|
|
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
|
|||
|
|
attach presence_store to MphfLayer(p, l)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Other derivations:
|
|||
|
|
- Threshold a count matrix → binary presence matrix
|
|||
|
|
- Union two presence matrices (same MPHF, different samples)
|
|||
|
|
- Merge two count matrices (saturating add, column-wise)
|
|||
|
|
|
|||
|
|
All derivations are local to a `(partition, layer)` pair and fully parallelisable.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Relationship to current implementation
|
|||
|
|
|
|||
|
|
The current `obilayeredmap` crate implements a subset of this architecture. Key divergences:
|
|||
|
|
|
|||
|
|
- `Layer<D: LayerData>` fuses `MphfLayer` and one `DataStore` into a single generic type. Multiple data stores on the same MPHF are not supported.
|
|||
|
|
- `LayerData::open(dir)` embeds the path convention (`counts/`, `presence/`) inside the store type, preventing the `PartitionedIndex` from managing paths externally.
|
|||
|
|
- The `Aggregator` pattern is not yet implemented; partial distance methods exist on `PersistentCompactIntVec` but are not composed across layers and partitions.
|
|||
|
|
- No `PartitionedIndex` type exists; `LayeredMap` is a single-partition structure.
|
|||
|
|
|
|||
|
|
Planned refactoring:
|
|||
|
|
1. Extract `MphfLayer` from `Layer<D>` as an autonomous type.
|
|||
|
|
2. Replace `LayerData` trait with `DataStore` trait (no path knowledge).
|
|||
|
|
3. Implement `LayeredPartition` that holds `Vec<MphfLayer>` and attaches data stores externally.
|
|||
|
|
4. Implement `PartitionedIndex` with parallel dispatch and the `Aggregator` pattern.
|