# obilayeredmap — layered kmer index crate ## Purpose `obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **collection → partition → layer**. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers. --- ## Three-level hierarchy ``` index_root/ ← LayeredMap (collection) meta.json part_00000/ ← Partition layer_0/ ← Layer mphf.bin unitigs.bin evidence.bin counts.bin presence.bin layer_1/ ... part_00001/ layer_0/ layer_1/ ... ``` **Collection** (`index_root/`): global metadata — kmer size k, number of partitions, layer count, sample registry. **Partition** (`part_XXXXX/`): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in `part_XXXXX`. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in `meta.json`. **Layer** (`layer_N/`): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer. --- ## Layer file layout ``` layer_N/ mphf.bin — ptr_hash MPHF (exact key count, phase-2 construction) unitigs.bin — packed 2-bit nucleotide sequences (concatenated, variable-length) unitig_offsets.bin — u32 per unitig: nucleotide offset of unitig j in unitigs.bin evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7) counts.bin — u32 per MPHF slot (total kmer occurrences) presence.bin — bit matrix: n_slots × n_samples [optional] ``` Unitigs have variable lengths. Each record in `unitigs.bin` is self-delimiting: it begins with a varint `seql` (sequence length in nucleotides) followed by `(seql+3)/4` packed bytes — the streaming format defined in `obiskio`. Sequential scan is always possible using the in-record `seql`. For O(1) random access, `unitig_offsets.bin` is a **precomputed derived index**: a u32 array of byte offsets into `unitigs.bin`, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from `unitigs.bin` if lost. Access: `unitigs.bin[offsets[j] .. offsets[j+1]]`. All files except `mphf.bin` are flat arrays of fixed-size elements, serialised with **rkyv** for zero-copy mmap access. `mphf.bin` uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions). ### Evidence encoding Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction. ``` slot s → unitig_id: u25 | rank: u7 ``` Packed into a `u32` (29 bits used, 3 spare). Decoding: ``` kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice ``` `rank` is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A `u7` (0–127) is sufficient. ### Presence/absence matrix Column-major bit matrix: column j (sample j) is a contiguous `n_slots`-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time. --- ## Query path A kmer query routes through all three levels: 1. **Partition routing**: hash canonical minimiser of the query kmer → partition index → open `part_XXXXX/`. 2. **Layer probing**: iterate layers in order within the partition; for each layer compute `slot = mphf.query(kmer)`, then verify `evidence.decode(slot) == kmer`. First match wins. 3. **Data access**: read `counts[slot]` and/or `presence[slot]` from the matching layer. ``` fn query(kmer) → Option: part = partition_of(kmer) for (i, layer) in part.layers.iter().enumerate(): slot = layer.mphf.query(kmer) if layer.evidence.decode(slot) == kmer: return Some(Hit { layer: i, slot }) return None ``` Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0. --- ## Add-layer algorithm When adding dataset B to an existing index: 1. For each partition, iterate kmers of B routed to that partition. 2. Probe existing layers; collect kmers absent from all layers → `B \ index`. 3. Build a new layer from `B \ index` using the two-phase pipeline (FMPHGO provisional → ptr_hash definitive). 4. Append the new layer directory under each `part_XXXXX/`. 5. Update `meta.json` (layer count, sample registry). Each partition's new layer is built independently; the operation is fully parallel across partitions. --- ## Core API (sketch) ```rust // Open an existing index let map = LayeredMap::open(path)?; // Query a canonical kmer across all partitions and layers match map.query(kmer) { Some(hit) => { let count = hit.count(); let present = hit.presence_row(); // bit slice over samples } None => { /* absent */ } } // Non-destructive extension with a new dataset // unitigs produced by the two-phase pipeline, one per partition let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?; ``` --- ## Dependencies | crate | role | |---|---| | `ptr_hash` | phase-2 MPHF per layer | | `ph` (FMPHGO) | phase-1 provisional MPHF during layer construction | | `rkyv` | zero-copy serialisation of flat arrays (evidence, counts, presence) | | `memmap2` | mmap of layer files | | `bitm` | bit-packed presence matrix | --- ## Serialisation strategy All flat arrays use `rkyv::Archive`: ```rust #[derive(Archive, Serialize, Deserialize)] struct Evidence { slots: Vec } // packed (unitig_id: 25 | rank: 7) #[derive(Archive, Serialize, Deserialize)] struct Counts { data: Vec } ``` At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement. --- ## Open questions - **ptr_hash + rkyv**: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper. - **Presence matrix layout**: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern. - **Layer merge**: merging two `LayeredMap` instances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path. - **Canonical kmer orientation**: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.