Files
obikmer/docmd/architecture/index_architecture.md
T
Eric Coissac 1881e98bad feat(bitvec): add partial Jaccard, fix padding, optimize constructor
Introduces `partial_jaccard_dist` to return raw intersection and union counts, improving Jaccard distance flexibility. Corrects `not()` to explicitly zero padding bits in the final word, ensuring accurate bit-counting for partially-filled words. Adds an optimized `build_from_counts` constructor.
2026-05-15 15:04:10 +08:00

7.1 KiB
Raw Blame History

Kmer index architecture

Fundamental invariant

A given canonical kmer belongs to exactly one partition and exactly one layer within that partition. This is the property that makes all aggregation operations decomposable and parallelisable without coordination.


Three-level hierarchy

PartitionedIndex
├── LayeredPartition  (one per minimiser bucket)
│   ├── MphfLayer 0         kmer → slot  (immutable bijection)
│   │   ├── DataStore A     slot → T     (e.g. counts)
│   │   └── DataStore B     slot → T     (e.g. presence/absence, derived)
│   ├── MphfLayer 1
│   │   └── DataStore A
│   └── ...
├── LayeredPartition
│   └── ...

PartitionedIndex: routes queries to partitions via canonical minimiser hash. Owns the partition count and routing scheme (fixed at creation). Dispatches aggregations across partitions in parallel.

LayeredPartition: one directory per minimiser bucket. Holds a Vec<MphfLayer>. Each layer covers a disjoint kmer set — layer 0 is built from dataset A; layer 1 covers kmers in B absent from layer 0; and so on. Layers within a partition are always disjoint.

MphfLayer: the MPHF + evidence + unitig spine. Maps kmer → slot for its disjoint kmer set. Immutable once built. Independent of any data attached to it.

DataStore: a slot-indexed data array (e.g. PersistentCompactIntMatrix, PersistentBitMatrix). Attached to a MphfLayer externally. Multiple stores of different types can coexist on the same MphfLayer.


MphfLayer — autonomous mapping layer

MphfLayer::find(kmer: CanonicalKmer) -> Option<usize>   // slot, or None if absent
MphfLayer::n() -> usize                                  // number of slots
MphfLayer::build(dir: &Path) -> OLMResult<(Self, usize)> // from unitigs.bin
MphfLayer::open(dir: &Path) -> OLMResult<Self>

find returns Some(slot) only if the kmer is actually in this layer (evidence check included). Returns None for kmers present in other layers or absent from the index.

The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once and never rebuilt. All data derivation operations (count → presence, thresholding, merging) reuse the same MphfLayer.


DataStore — slot-indexed data

trait DataStore {
    type Item;
    fn get(&self, slot: usize) -> Self::Item;
    fn n(&self) -> usize;
}

Concrete types from obicompactvec:

Type Item Use
PersistentCompactIntMatrix Box<[u32]> count per sample per slot
PersistentBitMatrix Box<[bool]> presence per sample per slot

A DataStore knows nothing about kmers or MPHFs. It is indexed by usize slot only. The path to its on-disk files is managed by the LayeredPartition, not embedded in the store type.


Query model

Point query — kmer → Option<Item>

minimiser(kmer) → partition p
for each layer l in p:
    slot = MphfLayer_l.find(kmer)
    if slot is Some:
        return DataStore_l.get(slot)
return None

O(n_layers) MPHF probes in the worst case; O(1) expected (kmer in layer 0). No cross-layer data fusion — the result comes from exactly one layer.

Sequence scan — sequence → Vec<(kmer, Option<Item>)>

Decompose into canonical kmers, group by partition, dispatch to each partition in parallel. Within a partition, probe layers in order per kmer. Collect results.

Parallelism: across partitions (independent). Within a partition: per-kmer probing is sequential across layers but different kmers are independent.

Aggregation — → Accumulator

For operations that traverse all kmers (distance, presence matrix, global counts):

result = reduce(
    for p in partitions:             // parallel
        for l in layers(p):          // parallel
            partial(DataStore_p_l)
)

Each (partition, layer) contributes an independent Partial. Global result = reduce(all partials).


Aggregator pattern

trait Aggregator<D: DataStore> {
    type Partial: Send;
    type Result;
    fn partial(&self, store: &D) -> Self::Partial;
    fn reduce(&self, parts: impl Iterator<Item=Self::Partial>) -> Self::Result;
}

Concrete aggregators:

Aggregator Partial Result
BrayCurtis(i, j) (sum_min, sum_a, sum_b): (u64, u64, u64) f64
Jaccard(i, j) (inter, union): (u64, u64) f64
Hellinger(i, j) (sum_sqrt_prod, sum_a, sum_b): (f64, f64, f64) f64
DistanceMatrix(metric) n×n partial matrix n×n f64 matrix
PresenceQuery(kmer) routed to point query

The partial for BrayCurtis(i, j) on a PersistentCompactIntMatrix with columns i and j already exists as PersistentCompactIntVec::partial_bray_dist — it needs to be lifted to the column-pair level on the matrix.


Parallelism model

Level Unit Coordination
Across partitions LayeredPartition none — fully independent
Across layers (aggregation) (partition, layer) pair none — disjoint kmer sets
Within a layer (point query) n/a — single layer per kmer n/a
DataStore derivation one (partition, layer) per task none

The dispatch model: PartitionedIndex::aggregate(aggregator) fans out over partitions (rayon par_iter), each partition fans out over its layers, collects partials, then a top-level reduce combines.


DataStore derivation

Because the MphfLayer is independent of its data stores, new stores can be derived from existing ones without rebuilding the MPHF:

// count → presence/absence, parallel across (partition, layer)
for (p, l) in all_partition_layer_pairs().par_iter():
    count_store = open PersistentCompactIntMatrix at (p, l)
    presence_store = PersistentBitMatrix::from_count_matrix(count_store, threshold, dir)
    attach presence_store to MphfLayer(p, l)

Other derivations:

  • Threshold a count matrix → binary presence matrix
  • Union two presence matrices (same MPHF, different samples)
  • Merge two count matrices (saturating add, column-wise)

All derivations are local to a (partition, layer) pair and fully parallelisable.


Relationship to current implementation

The current obilayeredmap crate implements a subset of this architecture. Key divergences:

  • Layer<D: LayerData> fuses MphfLayer and one DataStore into a single generic type. Multiple data stores on the same MPHF are not supported.
  • LayerData::open(dir) embeds the path convention (counts/, presence/) inside the store type, preventing the PartitionedIndex from managing paths externally.
  • The Aggregator pattern is not yet implemented; partial distance methods exist on PersistentCompactIntVec but are not composed across layers and partitions.
  • No PartitionedIndex type exists; LayeredMap is a single-partition structure.

Planned refactoring:

  1. Extract MphfLayer from Layer<D> as an autonomous type.
  2. Replace LayerData trait with DataStore trait (no path knowledge).
  3. Implement LayeredPartition that holds Vec<MphfLayer> and attaches data stores externally.
  4. Implement PartitionedIndex with parallel dispatch and the Aggregator pattern.