Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
7.1 KiB
Kmer index architecture
Fundamental invariant
A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.
Three-level hierarchy
PartitionedIndex
├── LayeredPartition (one per minimiser bucket)
│ ├── MphfLayer 0 kmer → slot (immutable bijection)
│ │ ├── DataStore A slot → T (e.g. counts)
│ │ └── DataStore B slot → T (e.g. presence/absence, derived)
│ ├── MphfLayer 1
│ │ └── DataStore A
│ └── ...
├── LayeredPartition
│ └── ...
PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.
LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.
MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.
DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.
MphfLayer — autonomous mapping layer
MphfLayer::find(kmer: CanonicalKmer) -> Option<usize> // slot, or None if absent
MphfLayer::n() -> usize // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>
find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.
The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.
DataStore — slot-indexed data
trait DataStore {
type Item;
fn get(&self, slot: usize) -> Self::Item;
fn n(&self) -> usize;
}
Concrete types from obicompactvec:
| Type | Item |
Use |
|---|---|---|
PersistentCompactIntMatrix |
Box<[u32]> |
count per sample per slot |
PersistentBitMatrix |
Box<[bool]> |
presence per sample per slot |
A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only. The path to its on-disk files is managed by the LayeredPartition, not embedded in the store type.
Query model
Point query — kmer → Option<Item>
minimiser(kmer) → partition p
for each layer l in p:
slot = MphfLayer_l.find(kmer)
if slot is Some:
return DataStore_l.get(slot)
return None
O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.
Sequence scan — sequence → Vec<(kmer, Option<Item>)>
Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.
Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.
Aggregation — → Accumulator
For operations that traverse all kmers (distance, presence matrix, global counts):
result = reduce(
for p in partitions: // parallel
for l in layers(p): // parallel
partial(DataStore_p_l)
)
Each (partition, layer) contributes an independent Partial. Global result = reduce(all partials).
Aggregator pattern
trait Aggregator<D: DataStore> {
type Partial: Send;
type Result;
fn partial(&self, store: &D) -> Self::Partial;
fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
}
Concrete aggregators:
| Aggregator | Partial |
Result |
|---|---|---|
BrayCurtis(i, j) |
(sum_min, sum_a, sum_b): (u64, u64, u64) |
f64 |
Jaccard(i, j) |
(inter, union): (u64, u64) |
f64 |
Hellinger(i, j) |
(sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64) |
f64 |
DistanceMatrix(metric) |
n×n partial matrix |
n×n f64 matrix |
PresenceQuery(kmer) |
— | routed to point query |
The partial for BrayCurtis(i, j) on a PersistentCompactIntMatrix with columns i and j already exists as PersistentCompactIntVec::partial_bray_dist — it needs to be lifted to the column-pair level on the matrix.
Parallelism model
| Level | Unit | Coordination |
|---|---|---|
| Across partitions | LayeredPartition |
none — fully independent |
| Across layers (aggregation) | (partition, layer) pair |
none — disjoint kmer sets |
| Within a layer (point query) | n/a — single layer per kmer | n/a |
| DataStore derivation | one (partition, layer) per task |
none |
The dispatch model: PartitionedIndex::aggregate(aggregator) fans out over partitions (rayon par_iter), each partition fans out over its layers, collects partials, then a top-level reduce combines.
DataStore derivation
Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:
// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
count_store = open PersistentCompactIntMatrix at (p, l)
presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
attach presence_store to MphfLayer(p, l)
Other derivations:
- Threshold a count matrix → binary presence matrix
- Union two presence matrices (same MPHF, different samples)
- Merge two count matrices (saturating add, column-wise)
All derivations are local to a (partition, layer) pair and fully parallelisable.
Relationship to current implementation
The current obilayeredmap crate implements a subset of this architecture. Key divergences:
Layer<D: LayerData>fusesMphfLayerand oneDataStoreinto a single generic type. Multiple data stores on the same MPHF are not supported.LayerData::open(dir)embeds the path convention (counts/,presence/) inside the store type, preventing thePartitionedIndexfrom managing paths externally.- The
Aggregatorpattern is not yet implemented; partial distance methods exist onPersistentCompactIntVecbut are not composed across layers and partitions. - No
PartitionedIndextype exists;LayeredMapis a single-partition structure.
Planned refactoring:
- Extract
MphfLayerfromLayer<D>as an autonomous type. - Replace
LayerDatatrait withDataStoretrait (no path knowledge). - Implement
LayeredPartitionthat holdsVec<MphfLayer>and attaches data stores externally. - Implement
PartitionedIndexwith parallel dispatch and theAggregatorpattern.