# obilayeredmap — layered kmer index crate ## Purpose `obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- ## Four usage modes The MPHF + evidence infrastructure is fixed for all modes. The **payload** — data associated with each slot — is orthogonal and varies by mode. | Mode | Description | Payload type | Storage | |---|---|---|---| | 1. Set | membership test only | `()` | — | | 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory | | 3. Presence/absence matrix | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory | | 4. Count matrix | occurrences per kmer per genome | `PersistentCompactIntMatrix` | `counts/` directory | Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. Mode 3 has a build path (`Layer::::build_presence`); mode 4 is not yet implemented. ### Payload for modes 2/4: PersistentCompactIntMatrix `PersistentCompactIntMatrix` is a column-major matrix stored in a directory: one `col_NNNNNN.pciv` file per column, plus a `meta.json`. Each column is a `PersistentCompactIntVec` — a mmap'd PCIV file with a `u8` primary array (255 = overflow sentinel), a sorted overflow section of `(slot: u64, value: u32)` entries, and a sparse L1-fitting index. Mode 2 writes 1 column per layer (one sample). Mode 4 writes G columns (one per genome). `read(slot)` returns `Box<[u32]>` — the full row across all columns. ### Payload for mode 3: PersistentBitMatrix `PersistentBitMatrix` is a column-major bit matrix stored in a directory: one `col_NNNNNN.pbiv` per genome, plus `meta.json`. Each column is a `PersistentBitVec` — a mmap'd PBIV file with u64 word-level bulk operations (AND, OR, XOR, NOT, POPCNT, Jaccard, Hamming). `read(slot)` returns `Box<[bool]>` — the presence vector across all genomes. Column-major layout makes per-genome set operations cache-friendly; the full row is assembled on demand at query time. --- ## Payload architecture The payload is orthogonal to the MPHF + evidence layer. `Layer` is parameterised by `D: LayerData`: ```rust pub trait LayerData: Sized { type Item; fn open(layer_dir: &Path) -> OLMResult; fn read(&self, slot: usize) -> Self::Item; } pub struct Layer { mphf: Mphf, evidence: Evidence, unitigs: UnitigFileReader, data: D, } pub struct Hit { pub slot: usize, pub data: T, } ``` `LayerData` covers the **read path only** (`open` + `read`). The write path (build) is intentionally not in the trait — build signatures differ between modes and forcing this into a trait would require an associated `Context` type with no benefit over specialized `impl` blocks. Implemented concrete types: | Type | `Item` | Description | |---|---|---| | `()` | `()` | mode 1 — membership only | | `PersistentCompactIntMatrix` | `Box<[u32]>` | modes 2/4 — one count per column | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — one presence bit per column | `LayeredMap` mirrors the same parameterisation: `LayeredMap`. --- ## Three-level hierarchy ``` index_root/ ← LayeredMap (collection) meta.json part_00000/ ← Partition layer_0/ ← Layer mphf.bin unitigs.bin unitigs.bin.idx evidence.bin counts/ [modes 2/4] meta.json {"n": N, "n_cols": 1} col_000000.pciv presence/ [mode 3] meta.json {"n": N, "n_cols": G} col_000000.pbiv col_000001.pbiv ... layer_1/ ... part_00001/ layer_0/ ... ``` **Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry. **Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`. **Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer. --- ## Layer file layout ``` layer_N/ mphf.bin — ptr_hash MPHF (epserde, ptr_hash native format) unitigs.bin — packed 2-bit nucleotide sequences (obiskio binary format) unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[] evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7) counts/ — [modes 2/4] PersistentCompactIntMatrix presence/ — [mode 3] PersistentBitMatrix ``` `unitigs.bin` is the packed-2-bit sequence file produced by `obiskio::UnitigFileWriter`. The companion `.idx` file stores: magic `UIDX`, `n_unitigs: u32`, `n_kmers: u64`, `seqls: [u8; n_unitigs]` (kmer count − 1 per chunk), and `packed_offsets: [u32; n_unitigs + 1]` (byte offsets into `unitigs.bin`, sentinel-terminated). This gives O(1) random access to any unitig and the total kmer count without scanning the sequence file. ### Evidence encoding Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction. ``` slot s → unitig_id: u25 | rank: u7 ``` Packed into a `u32` (29 bits used, 3 spare). Decoding: ``` kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice ``` `rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient. --- ## ptr_hash configuration The MPHF per layer is configured as: ```rust type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: balanced (2.4 bits/key, λ=3.5) CachelineEfVec>, // remap: 11.6 bits/entry vs 32 for Vec Xx64, // hasher: XXH3-64 with seed, handles structured keys Vec, // pilots >; ``` **Hasher choice — `Xx64`:** k-mer raw values are left-aligned u64 with structural zeros in low bits (42 zeros for k=11, 2 zeros for k=31). `FxHash` (single multiply) distributes these poorly. `Xx64` (XXH3 64-bit, seeded) handles structured input correctly. **Bucket function — `CubicEps` with `PtrHashParams::::default()`:** λ=3.5, α=0.99. Balanced tradeoff: 2× slower construction than `Linear/λ=3.0` (the `default_fast` preset), 20% less space. `default_compact` (λ=4.0) saves a further 12.5% at 2× more construction time and reduced reliability — not chosen. **Remap — `CachelineEfVec`:** Elias-Fano variant packing 44 sorted 40-bit values per 64-byte cacheline (11.6 bits/value vs 32 for `Vec`). Already a transitive dependency of `ptr_hash`. One cacheline per query vs one u32 read; space win dominates for billion-scale key sets. --- ## Build path The build path is not part of `LayerData`. Each mode exposes its own `impl Layer::build` with the exact signature it needs. Two private module-level helpers avoid code duplication: **`build_mphf(out_dir, n) -> OLMResult`**: first pass — opens `unitigs.bin`, iterates all canonical kmers in parallel via `new_from_par_iter`, stores `mphf.bin`. O(n). **`build_second_pass(out_dir, n, mphf, fill_slot) -> OLMResult<()>`**: second pass — opens `unitigs.bin` again, fills `evidence.bin` and a compact n/8-byte seen-bitset (MPHF correctness check inline), calls `fill_slot(slot, kmer)` once per kmer for the mode-specific payload. O(n). ```rust // mode 1 impl Layer<()> { pub fn build(out_dir: &Path) -> OLMResult } // modes 2/4 impl Layer { pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult pub fn build_from_map(out_dir: &Path, counts: &HashMap) -> OLMResult } // mode 3 impl Layer { pub fn build_presence( out_dir: &Path, n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool, ) -> OLMResult } ``` Mode 2 creates a `PersistentCompactIntMatrixBuilder` with 1 column and fills it via `build_second_pass`. Mode 3 creates a `PersistentBitMatrixBuilder` with `n_genomes` columns and fills all columns in a single pass. Any duplicate slot or out-of-bounds index detected during `build_second_pass` returns `OLMError::Mphf`. `new_from_par_iter` avoids materialising all keys as `Vec`. --- ## Query path A kmer query routes through all three levels: 1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`. 2. **Layer probing**: iterate layers in order; for each layer compute `slot = mphf.index(kmer)`, decode evidence, compare to query. First match wins. 3. **Data access**: `layer.data.read(slot)` returns `D::Item`. ```rust // pseudo-code fn query(kmer) -> Option<(usize, Hit)>: for (i, layer) in self.layers.iter().enumerate(): slot = layer.mphf.index(&kmer.raw()) if layer.evidence.decode(slot) == kmer: return Some((i, Hit { slot, data: layer.data.read(slot) })) return None ``` Expected probe depth: 1 for kmers in layer 0, increasing for later layers. For mode 2, `hit.data` is `Box<[u32]>` with 1 element; `hit.data[0]` is the count. For mode 3, `hit.data` is `Box<[bool]>` with G elements, one per genome. --- ## Add-layer algorithm When adding dataset B to an existing index: 1. For each partition, iterate kmers of B routed to that partition. 2. Probe existing layers; collect kmers absent from all layers → `B \ index`. 3. Build a new layer from `B \ index`. 4. Append the new layer directory under each `part_XXXXX/`. 5. Update `meta.json` (layer count, sample registry). Each partition's new layer is built independently; the operation is fully parallel across partitions. --- ## Dependencies | crate | role | |---|---| | `ptr_hash 1.1` | MPHF per layer (epserde serialisation) | | `cacheline-ef 1.1` | compact remap storage inside ptr_hash | | `epserde 0.8` | zero-copy serialisation of MPHF | | `memmap2` | mmap of layer files | | `obiskio` | unitig file writer/reader | | `obicompactvec` | payload types: `PersistentCompactIntMatrix`, `PersistentBitMatrix` | --- ## Relationship to target architecture The target architecture (see [Kmer index architecture](../architecture/index_architecture.md)) separates `MphfLayer` from data stores entirely and introduces a `PartitionedIndex` with parallel dispatch and an `Aggregator` pattern. The current implementation is a stepping stone: `obicompactvec` types are already fully decoupled from the MPHF; the remaining refactoring is within `obilayeredmap` itself. --- ## Open questions - **Mode 4**: count matrix (n_kmers × n_genomes × bytes_per_count) is structurally identical to mode 3 but uses `PersistentCompactIntMatrix` with G columns. Build API not yet implemented. Scale concern: hundreds of GB for large collections — a sparse representation may be required at high genome counts. - **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model. - **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time. - **`try_new_from_par_iter`**: `ptr_hash::new_from_par_iter` silently discards construction failure. Post-construction verification (current workaround) is correct but does not allow retry. A `try_new_from_par_iter` PR upstream would close this gap.