# obilayeredmap — layered kmer index crate ## Purpose `obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- ## Three usage modes The MPHF + evidence infrastructure is the same for all modes. The **payload** varies. | Mode | Description | Payload type | Storage | |---|---|---|---| | 1. Set | membership test only | `()` | — | | 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory | | 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory | Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate. --- ## Evidence kinds Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time: ```rust pub enum EvidenceKind { Exact, Approx { b: u8, z: u8 }, } ``` `EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load. - **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time. - **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required. --- ## MphfLayer — autonomous kmer → slot mapping `MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload. ```rust pub struct MphfLayer { mphf: Mphf, ev: LayerEvidence, // loaded at open() time n: usize, } ``` `LayerEvidence` is an internal enum, not public: ```rust enum LayerEvidence { Exact { evidence: Evidence, unitigs: UnitigFileReader }, Approx { fingerprint: FingerprintVec }, } ``` ### Query API Three public query methods, all returning `Option` (slot index): ```rust pub fn find(&self, kmer: CanonicalKmer) -> Option pub fn find_exact(&self, kmer: CanonicalKmer) -> Option pub fn find_approx(&self, kmer: CanonicalKmer) -> Option ``` - `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`. - `find_exact` panics if the layer holds approximate evidence; zero false positives. - `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer. `open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes. ### Build surface ```rust // Full MPHF + exact evidence build (two-pass, parallel) pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult // Evidence-only builds (MPHF already present in dir) pub fn build_exact_evidence(dir, block_bits) -> OLMResult pub fn build_approx_evidence(dir, b, z) -> OLMResult pub fn build_evidence(dir, kind, block_bits) -> OLMResult // dispatch ``` `MphfLayer::build` runs two sequential passes over `unitigs.bin`: 1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. 2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. `build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction. For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files. --- ## Layer\ — MPHF + payload `Layer` pairs an `MphfLayer` with one payload store. ```rust pub trait LayerData: Sized { type Item; fn open(layer_dir: &Path) -> OLMResult; fn read(&self, slot: usize) -> Self::Item; } pub struct Layer { mphf: MphfLayer, data: D, } pub struct Hit { pub slot: usize, pub data: T, } ``` `LayerData` covers the **read path only** (`open` + `read`). Build signatures differ between modes and are not part of the trait. | Type | `Item` | Description | |---|---|---| | `()` | `()` | mode 1 — membership only | | `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) | | `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) | ### Build signatures ```rust // mode 1 impl Layer<()> { pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult } // mode 2 impl Layer { pub fn build(out_dir: &Path, block_bits: u8, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult pub fn build_from_map(out_dir: &Path, block_bits: u8, counts: &HashMap) -> OLMResult } // mode 3 impl Layer { pub fn build_presence(out_dir: &Path, block_bits: u8, n_genomes: usize, present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult } ``` All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`. ### Evidence build helpers on Layer ```rust impl Layer { pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult } ``` These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer` level. --- ## FingerprintVec and FingerprintVecWriter Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot. ``` fingerprint.bin format: magic: b"FPVF" (4 bytes) b: u8 (bits per fingerprint, 1..=64) padding: [0u8; 3] n: u64 LE (number of slots) data: packed bits, ceil(n*b/8) bytes, Lsb0 order ``` ```rust impl FingerprintVec { pub fn open(path: &Path) -> OLMResult pub fn get(&self, slot: usize) -> u64 pub fn matches(&self, slot: usize, fingerprint: u64) -> bool pub fn n(&self) -> usize pub fn b(&self) -> u8 } ``` `matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash`. It is the core operation of `find_approx`. --- ## LayeredMap\ — collection of layers `LayeredMap` wraps `Vec>` for a single partition directory. ```rust pub struct LayeredMap { root: PathBuf, meta: PartitionMeta, layers: Vec>, } ``` `PartitionMeta` (`meta.json` at the partition root) stores `n_layers`. ### Common methods ```rust pub fn open(root: &Path) -> OLMResult pub fn create(root: &Path) -> OLMResult pub fn n_layers(&self) -> usize pub fn layer(&self, i: usize) -> &Layer pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit)> pub fn next_layer_writer(&self) -> OLMResult ``` `query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0. ### push_layer `push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer`, using `DEFAULT_BLOCK_BITS`: ```rust // mode 1 impl LayeredMap<()> { pub fn push_layer(&mut self) -> OLMResult } // mode 2 impl LayeredMap { pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult pub fn push_layer_from_map(&mut self, counts: &HashMap) -> OLMResult } ``` Mode 3 (`PersistentBitMatrix`) has no `push_layer` on `LayeredMap`; callers build directly via `Layer::build_presence`. --- ## LayeredStore\ and aggregation traits `LayeredStore` is a generic aggregation wrapper over `Vec`. It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls: ```rust pub struct LayeredStore(pub Vec); impl ColumnWeights for LayeredStore { … } // Σ col_weights across inner stores impl CountPartials for LayeredStore { … } // element-wise Σ partials impl BitPartials for LayeredStore { … } // element-wise Σ partials ``` Because blanket impls compose, `LayeredStore>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type. **Leaf implementors** (in `obicompactvec`): | Type | Traits | |---|---| | `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()`) + `CountPartials` | | `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()`) + `BitPartials` | See [Kmer index architecture](../architecture/index_architecture.md) for the full trait API and the two-pass normalised-metric pattern. --- ## On-disk structure ``` partition_root/ ← LayeredMap (one partition) meta.json — {"n_layers": N} layer_0/ ← Layer layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z} mphf.bin — ptr_hash MPHF (epserde format) unitigs.bin — packed 2-bit nucleotide sequences unitigs.bin.idx — UIDX index (exact evidence only) evidence.bin — [u32; n], LE (exact evidence only) fingerprint.bin — packed b-bit array (approx evidence only) counts/ [mode 2] PersistentCompactIntMatrix meta.json col_000000.pciv presence/ [mode 3] PersistentBitMatrix meta.json col_000000.pbiv … layer_1/ … ``` `unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction. --- ## Evidence encoding (exact) `evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot: ``` bits [31:7] = chunk_id (25 bits) — index of the unitig chunk bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based) ``` `chunk_id = raw >> 7`, `rank = raw & 0x7F`. Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access). For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. --- ## ptr_hash configuration ```rust type Mphf = PtrHash< u64, // key type: canonical kmer raw encoding CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99 CachelineEfVec>, // remap: Elias-Fano Xx64, // hasher: XXH3-64 with seed Vec, // pilots >; ``` `Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly. `CubicEps` with `PtrHashParams::::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0`, ~20% less space. --- ## Column append and merge support These methods extend existing layers with new genome columns without touching the MPHF. ### Layer-level genome column append ```rust impl Layer { pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()> } impl Layer { pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()> } ``` Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv`) and update `meta.json` to increment `n_cols`. `value_of` is called once per slot (0..n). ### Presence matrix initialisation ```rust impl Layer<()> { pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> } ``` Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended. ### Why the MPHF is never rebuilt The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. --- ## Add-layer algorithm When adding dataset B to an existing index: 1. For each partition, probe existing layers for kmers of B routed to that partition. 2. Collect kmers absent from all layers → `B \ index`. 3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()`. 4. Call `Layer::build` (or `build_presence`) on the new layer directory. 5. Call `push_layer` (or `append_layer`) to register the new layer in `meta.json`. Each partition's new layer is built independently; the operation is fully parallel across partitions. --- ## Dependencies | crate | role | |---|---| | `ptr_hash 1.1` | MPHF per layer | | `cacheline-ef 1.1` | compact remap inside ptr_hash | | `epserde 0.8` | zero-copy MPHF serialisation | | `memmap2 0.9` | mmap of evidence and fingerprint files | | `bitvec` | packed b-bit fingerprint storage | | `obiskio` | unitig file writer/reader + `.idx` build | | `obicompactvec` | payload types + aggregation traits | | `rayon 1` | parallel MPHF construction pass | | `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |