# Kmer index architecture ## Fundamental invariant A given canonical kmer belongs to **exactly one partition** and **exactly one layer** within that partition. This property makes all aggregation operations decomposable and parallelisable without coordination. --- ## Three-level hierarchy ``` KmerIndex (index.meta + KmerPartition) ├── partition_0/index/ one directory per minimiser bucket │ ├── meta.json PartitionMeta { n_layers } │ ├── layer_0/ │ │ ├── layer_meta.json LayerMeta { evidence: EvidenceKind } │ │ ├── mphf.bin PtrHash MPHF │ │ ├── unitigs.bin unitig spine (never overwritten) │ │ ├── evidence.bin exact evidence (Exact only) │ │ ├── unitigs.bin.idx block index (Exact only) │ │ ├── fingerprint.bin fingerprints (Approx only) │ │ ├── counts/ PersistentCompactIntMatrix (with_counts = true) │ │ └── presence/ PersistentBitMatrix │ └── layer_1/ │ └── ... └── partition_1/index/ └── ... ``` **KmerIndex**: root entry point. Owns `IndexMeta` (written to `index.meta`) and a `KmerPartition` that routes canonical kmers to partition directories. All partition-level operations are dispatched in parallel via rayon. **Partition directory**: one directory per minimiser bucket. `PartitionMeta` (stored as `meta.json`) records `n_layers`. Layers within a partition cover disjoint kmer sets. **Layer directory**: one `MphfLayer` plus optional data stores. `LayerMeta` (stored as `layer_meta.json`) records which `EvidenceKind` was used. The MPHF and `unitigs.bin` are immutable once built; evidence files are the only part replaced by `reindex`. --- ## IndexConfig and IndexMeta ```rust pub struct IndexConfig { pub kmer_size: usize, pub minimizer_size: usize, pub n_bits: usize, // log2(n_partitions) pub with_counts: bool, pub evidence: EvidenceKind, pub block_bits: u8, // .idx granularity: 2^block_bits unitigs/block; 0 = one entry per unitig } pub struct IndexMeta { pub version: u32, pub config: IndexConfig, pub genomes: Vec, // ordered; index = genome column number } pub struct GenomeInfo { pub label: String, pub meta: HashMap, // arbitrary categorical metadata } ``` `IndexMeta` is serialised as `index.meta` (JSON). It is the authority for the ordered list of genomes and for the parameters that govern all subsequent operations on the index. --- ## EvidenceKind ```rust pub enum EvidenceKind { Exact, Approx { b: u8, z: u8 }, } ``` Controls which files are written per layer and which query path is taken: | Variant | Files written | False-positive rate | |---|---|---| | `Exact` | `evidence.bin`, `unitigs.bin.idx` | 0 | | `Approx { b, z }` | `fingerprint.bin` | ≈ W / 2^(b·z) per read (Findere) | `EvidenceKind` is stored both in `IndexConfig` (index-wide default, updated by `reindex`) and in each `LayerMeta` (per-layer record of what was actually built). --- ## MphfLayer — autonomous kmer → slot mapping ```rust pub struct MphfLayer { mphf: PtrHash<…>, ev: LayerEvidence, // Exact { evidence, unitigs } | Approx { fingerprint } n: usize, } ``` `MphfLayer::find(kmer)` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time (read from `layer_meta.json`). Returns `Some(slot)` only if the kmer is confirmed present; `None` for absent or out-of-range. ``` find_exact: slot = mphf(kmer); decode evidence → (chunk_id, rank); verify kmer in unitigs find_approx: slot = mphf(kmer); check fingerprint[slot] == seq_hash(kmer) ``` `block_bits` controls the `.idx` file written alongside `evidence.bin`. At `block_bits = 0`, every unitig chunk has an index entry, giving O(1) random access; larger values trade access time for a smaller `.idx`. The MPHF and `unitigs.bin` are never rebuilt by any post-build operation. --- ## Layer\ — MPHF + data payload ```rust pub struct Layer { mphf: MphfLayer, data: D, } ``` `D` selects the attached data payload: | `D` | Data directory | `Item` returned by `query` | |---|---|---| | `()` | — | `()` (set membership only) | | `PersistentCompactIntMatrix` | `counts/` | `Box<[u32]>` (counts per genome) | | `PersistentBitMatrix` | `presence/` | `Box<[bool]>` (presence per genome) | `Layer::query(kmer)` delegates to `MphfLayer::find`, then calls `data.read(slot)` if a slot is returned. Both exact and approximate evidence are handled transparently; the caller sees only `Option>`. Build-time entry points: ```rust Layer<()>::build(out_dir, block_bits) // set membership Layer::build(out_dir, block_bits, count_of) Layer::build_presence(out_dir, block_bits, n_genomes, present_in) Layer::<()>::build_evidence(layer_dir, kind, block_bits) // evidence only (reindex path) ``` --- ## DataStore — slot-indexed data `PersistentCompactIntMatrix` and `PersistentBitMatrix` are slot-indexed stores. They know nothing about kmers or MPHFs. | Type | `Item` | Aggregation method | Use | |---|---|---|---| | `PersistentCompactIntMatrix` | `Box<[u32]>` | `sum() → Array1` | counts per genome per slot | | `PersistentBitMatrix` | `Box<[bool]>` | `count_ones() → Array1` | presence per genome per slot | --- ## Aggregation traits — `obicompactvec::traits` Three traits unify the aggregation API across all hierarchy levels. ```rust trait ColumnWeights: Send + Sync { fn col_weights(&self) -> Array1; } trait CountPartials: ColumnWeights { fn partial_bray(&self) -> Array2; fn partial_euclidean(&self) -> Array2; fn partial_threshold_jaccard(&self, threshold: u32) -> (Array2, Array2); fn partial_relfreq_bray(&self, global: &Array1) -> Array2; fn partial_relfreq_euclidean(&self, global: &Array1) -> Array2; fn partial_hellinger(&self, global: &Array1) -> Array2; // provided finalisation methods with default impls fn bray_dist_matrix(&self) -> Array2 { … } fn relfreq_bray_dist_matrix(&self) -> Array2 { … } // … } trait BitPartials: ColumnWeights { fn partial_jaccard(&self) -> (Array2, Array2); fn partial_hamming(&self) -> Array2; // provided fn jaccard_dist_matrix(&self) -> Array2 { … } fn hamming_dist_matrix(&self) -> Array2 { … } } ``` Leaf implementors: | Type | Traits | |---|---| | `PersistentCompactIntMatrix` | `ColumnWeights`, `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights`, `BitPartials` | --- ## LayeredStore\ — recursive aggregation wrapper ```rust pub struct LayeredStore(Vec); ``` Three blanket impls propagate all traits up the hierarchy: ```rust impl ColumnWeights for LayeredStore { … } impl CountPartials for LayeredStore { … } impl BitPartials for LayeredStore { … } ``` This makes `LayeredStore>` automatically implement `CountPartials` — no separate `PartitionedStore` type is needed: ``` PersistentCompactIntMatrix leaf (one layer) LayeredStore one partition (layers are disjoint) LayeredStore> whole index (partitions are independent) ``` Normalised metrics require global column sums — computed in a two-pass cascade: ```rust // on LayeredStore> fn relfreq_bray_dist_matrix(&self) -> Array2 { let global = self.col_weights(); // pass 1 — sums up hierarchy let p = self.partial_relfreq_bray(&global); // pass 2 — global broadcast read-only p.mapv(|v| 1.0 - v) } ``` Because each kmer belongs to exactly one `(partition, layer)` pair, `col_weights()` has no double-counting across the hierarchy. --- ## Progressive aggregation principle No level reaches two levels down. Each level sums contributions from the level immediately below: ``` PersistentCompactIntMatrix::col_weights() — one (partition, layer) ↓ Σ across layers LayeredStore::col_weights() — one partition ↓ Σ across partitions LayeredStore>::col_weights() — global ``` The same cascade applies to every partial method. --- ## Multi-genome column invariant After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the current total in `index.meta`. This holds for both `PersistentCompactIntMatrix` and `PersistentBitMatrix`. Maintained by three coordinated operations: **Existing layers — column append.** `Layer::append_genome_column` appends one column to each existing layer. Slots matching the incoming genome receive its count or `true`; all other slots receive 0 or `false`. **New layers — absent columns prepended.** When a new layer is created for kmers unique to the incoming genome, `n_existing_genomes` absent columns are prepended before the incoming genome's column, so the new layer immediately has the same column count as all other layers. **First merge, Presence mode — `init_presence_matrix`.** The initial single-genome index has no `presence/` directory (presence is implicit). On the first merge, `Layer<()>::init_presence_matrix` materialises genome 0's presence column (all `true`) retroactively, raising the column count from 0 to 1 before appending column 1. This invariant is the precondition for correct progressive aggregation: every level can blindly sum matrices from below because all matrices have the same shape. --- ## Query model ### Point query ``` minimiser(kmer) → partition p for each layer l in p: if let Some(slot) = MphfLayer_l.find(kmer): return data_l.read(slot) return None ``` O(n_layers) MPHF probes worst case; O(1) expected. The result comes from exactly one `(partition, layer)`. ### Aggregation ``` result = reduce( for p in partitions: // parallel for l in layers(p): // parallel partial(data_p_l) ) ``` For normalised metrics, replace with the two-pass cascade. --- ## Parallelism model | Level | Unit | Coordination | |---|---|---| | Across partitions | inner stores of `LayeredStore>` | none | | Across layers within a partition | inner stores of `LayeredStore` | none — disjoint kmer sets | | Normalised pass 1 (`col_weights`) | per inner store | none — additive | | Normalised pass 2 (partial) | per inner store | `global` broadcast read-only | | Within a matrix (distance) | upper-triangle pair `(i,j)` | none — rayon `par_iter` | --- ## reindex — evidence conversion in place `KmerIndex::reindex(target, block_bits)` converts every layer's evidence bundle to `target` without touching the MPHF or `unitigs.bin`: - `→ Exact`: builds `evidence.bin` + `unitigs.bin.idx`; removes `fingerprint.bin` - `→ Approx { b, z }`: builds `fingerprint.bin`; removes `evidence.bin` + `unitigs.bin.idx` On success, `IndexConfig::evidence` and `IndexConfig::block_bits` are updated in `index.meta`. Each layer's `layer_meta.json` is also rewritten with the new `EvidenceKind`. --- ## estimate — parameter dry-run `estimate` resolves approximate-evidence parameters (`z`, `b`, target FP rate) and prints the resulting effective kmer size and per-kmer / per-z-window false-positive rates without touching any index. Used to calibrate `Approx { b, z }` before building or reindexing.