# obilayeredmap — layered kmer index crate ## Purpose `obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer**. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- ## Three usage modes The MPHF + evidence infrastructure is the same for all modes. The **payload** varies. | Mode | Description | Payload type | Storage | |---|---|---|---| | 1. Set | membership test only | `()` | — | | 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory | | 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory | Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. --- ## MphfLayer — autonomous kmer → slot mapping `MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data. ```rust pub struct MphfLayer { mphf: Mphf, evidence: Evidence, unitigs: UnitigFileReader, n: usize, // number of indexed kmers = number of MPHF slots } ``` Public API: ```rust impl MphfLayer { pub fn open(dir: &Path) -> OLMResult pub fn find(&self, kmer: CanonicalKmer) -> Option // Some(slot) or None pub fn n(&self) -> usize pub fn unitig_writer(dir: &Path) -> OLMResult pub(crate) fn build( dir: &Path, fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, ) -> OLMResult } ``` `find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test). `build` runs two sequential passes over `unitigs.bin`: 1. **Pass 1**: iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. 2. **Pass 2**: iterate again sequentially, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin`. --- ## Layer\ — MPHF + payload `Layer` pairs an `MphfLayer` with one payload store. ```rust pub trait LayerData: Sized { type Item; fn open(layer_dir: &Path) -> OLMResult; fn read(&self, slot: usize) -> Self::Item; } pub struct Layer { mphf: MphfLayer, data: D, } pub struct Hit { pub slot: usize, pub data: T, } ``` `LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not in the trait. | Type | `Item` | Description | |---|---|---| | `()` | `()` | mode 1 — membership only | | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | **Build signatures:** ```rust // mode 1 impl Layer<()> { pub fn build(out_dir: &Path) -> OLMResult } // mode 2 impl Layer { pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult } // mode 3 impl Layer { pub fn build_presence( out_dir: &Path, n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool, ) -> OLMResult } ``` All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build`. Mode 3 does the same for `PersistentBitMatrixBuilder`. --- ## LayeredStore\ and aggregation traits `LayeredStore` is a generic aggregation wrapper over `Vec`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls: ```rust pub struct LayeredStore(pub Vec); impl ColumnWeights for LayeredStore { … } // Σ col_weights across inner stores impl CountPartials for LayeredStore { … } // element-wise Σ partials impl BitPartials for LayeredStore { … } // element-wise Σ partials ``` Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. **Aggregation hierarchy:** ``` PersistentCompactIntMatrix implements CountPartials LayeredStore via blanket impl (one partition) LayeredStore> via blanket impl (partitioned index) ``` **Leaf implementors** (in `obicompactvec`): | Type | Traits | |---|---| | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | `PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators. See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. --- ## On-disk structure ``` index_root/ ← LayeredMap (collection) meta.json part_00000/ ← Partition layer_0/ ← Layer mphf.bin — ptr_hash MPHF (epserde format) unitigs.bin — packed 2-bit nucleotide sequences unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE counts/ [mode 2] PersistentCompactIntMatrix meta.json {"n": N, "n_cols": 1} col_000000.pciv presence/ [mode 3] PersistentBitMatrix meta.json {"n": N, "n_cols": G} col_000000.pbiv … layer_1/ … part_00001/ … ``` **Partition** (`part_XXXXX/`): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel. **Layer** (`layer_N/`): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint. --- ## Evidence encoding `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: ``` bits [31:7] = chunk_id (25 bits) — index of the unitig chunk bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) ``` Decoding: `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id`. For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer. --- ## ptr_hash configuration ```rust type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 CachelineEfVec>, // remap: 11.6 bits/entry (Elias-Fano) Xx64, // hasher: XXH3-64 with seed Vec, // pilots >; ``` `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. `CubicEps` with `PtrHashParams::::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0`, 20% less space. --- ## Query path ```rust pub fn query(&self, kmer: CanonicalKmer) -> Option> { self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) }) } ``` `MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit. In `LayeredMap`, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0. --- ## Add-layer algorithm When adding dataset B to an existing index: 1. For each partition, probe existing layers for kmers of B routed to that partition. 2. Collect kmers absent from all layers → `B \ index`. 3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer`. 4. Call `Layer::build` on the new directory. 5. Update `meta.json`. Each partition's new layer is built independently; the operation is fully parallel across partitions. --- ## Dependencies | crate | role | |---|---| | `ptr_hash 1.1` | MPHF per layer | | `cacheline-ef 1.1` | compact remap inside ptr_hash | | `epserde 0.8` | zero-copy MPHF serialisation | | `memmap2 0.9` | mmap of evidence and payload files | | `obiskio` | unitig file writer/reader | | `obicompactvec` | payload types + aggregation traits | | `rayon 1` | parallel MPHF construction pass | | `ndarray 0.16` | aggregation output arrays | --- ## Column append and merge support These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command. ### Matrix column append ```rust impl PersistentCompactIntMatrix { pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> } impl PersistentBitMatrix { pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> } ``` Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change. ### Presence matrix initialisation ```rust impl Layer<()> { pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> } ``` Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended. ### Layer-level genome column append ```rust impl Layer { pub fn append_genome_column( layer_dir: &Path, value_of: impl Fn(usize) -> bool, ) -> OLMResult<()> } impl Layer { pub fn append_genome_column( layer_dir: &Path, value_of: impl Fn(usize) -> u32, ) -> OLMResult<()> } ``` These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction. ### Why the MPHF is never rebuilt The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.