2026-05-09 17:20:08 +08:00
# obilayeredmap — layered kmer index crate
## Purpose
2026-05-17 10:20:22 +08:00
`obilayeredmap` implements a persistent, incrementally extensible kmer index. The index is organised in three levels: **index root → partition → layer ** . Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
2026-05-09 17:20:08 +08:00
---
2026-05-17 10:20:22 +08:00
## Three usage modes
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
The MPHF + evidence infrastructure is the same for all modes. The **payload ** varies.
2026-05-13 06:24:43 +08:00
2026-05-14 09:31:11 +08:00
| Mode | Description | Payload type | Storage |
2026-05-14 09:24:25 +08:00
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
2026-05-14 09:31:11 +08:00
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
2026-05-17 10:20:22 +08:00
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
---
## MphfLayer — autonomous kmer → slot mapping
`MphfLayer` encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
``` rust
pub struct MphfLayer {
mphf : Mphf ,
evidence : Evidence ,
unitigs : UnitigFileReader ,
n : usize , // number of indexed kmers = number of MPHF slots
}
```
Public API:
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
``` rust
impl MphfLayer {
pub fn open ( dir : & Path ) -> OLMResult < Self >
pub fn find ( & self , kmer : CanonicalKmer ) -> Option < usize > // Some(slot) or None
pub fn n ( & self ) -> usize
pub fn unitig_writer ( dir : & Path ) -> OLMResult < UnitigFileWriter >
pub ( crate ) fn build (
dir : & Path ,
fill_slot : & mut impl FnMut ( usize , CanonicalKmer ) -> OLMResult < ( ) > ,
) -> OLMResult < usize >
}
```
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
`find` returns `Some(slot)` only after verifying via evidence that the kmer is actually indexed. It returns `None` for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
`build` runs two sequential passes over `unitigs.bin` :
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
1. **Pass 1 ** : iterate all canonical kmers in parallel via rayon, construct and store `mphf.bin` . `new_from_par_iter` avoids materialising a full key `Vec` .
2. **Pass 2 ** : iterate again sequentially, fill `evidence.bin` , call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8` -byte seen-bitset verifies MPHF injectivity inline.
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
For empty layers (n = 0), `build` returns `Ok(0)` immediately after creating empty `mphf.bin` and `evidence.bin` .
2026-05-13 06:24:43 +08:00
---
2026-05-17 10:20:22 +08:00
## Layer\<D: LayerData\> — MPHF + payload
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
`Layer<D>` pairs an `MphfLayer` with one payload store.
2026-05-13 06:24:43 +08:00
``` rust
2026-05-14 09:24:25 +08:00
pub trait LayerData : Sized {
type Item ;
fn open ( layer_dir : & Path ) -> OLMResult < Self > ;
fn read ( & self , slot : usize ) -> Self ::Item ;
}
pub struct Layer < D : LayerData = ( ) > {
2026-05-17 10:20:22 +08:00
mphf : MphfLayer ,
data : D ,
2026-05-13 06:24:43 +08:00
}
2026-05-14 09:24:25 +08:00
pub struct Hit < T = ( ) > {
pub slot : usize ,
pub data : T ,
}
2026-05-13 06:24:43 +08:00
```
2026-05-17 10:20:22 +08:00
`LayerData` covers the **read path only ** (`open` + `read` ). Build signatures differ between modes and are not in the trait.
2026-05-13 06:24:43 +08:00
2026-05-14 09:24:25 +08:00
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
2026-05-17 10:20:22 +08:00
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
**Build signatures: **
``` rust
// mode 1
impl Layer < ( ) > {
pub fn build ( out_dir : & Path ) -> OLMResult < usize >
}
// mode 2
impl Layer < PersistentCompactIntMatrix > {
pub fn build ( out_dir : & Path , count_of : impl Fn ( CanonicalKmer ) -> u32 ) -> OLMResult < usize >
pub fn build_from_map ( out_dir : & Path , counts : & HashMap < CanonicalKmer , u32 > ) -> OLMResult < usize >
}
// mode 3
impl Layer < PersistentBitMatrix > {
pub fn build_presence (
out_dir : & Path ,
n_genomes : usize ,
present_in : impl Fn ( CanonicalKmer , usize ) -> bool ,
) -> OLMResult < usize >
}
```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Mode 2 pre-reads `n_kmers` from `unitigs.bin` to size the `PersistentCompactIntMatrixBuilder` before calling `MphfLayer::build` . Mode 3 does the same for `PersistentBitMatrixBuilder` .
---
## LayeredStore\<S\> and aggregation traits
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>` . It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
``` rust
pub struct LayeredStore < S > ( pub Vec < S > ) ;
impl < S : ColumnWeights > ColumnWeights for LayeredStore < S > { … } // Σ col_weights across inner stores
impl < S : CountPartials > CountPartials for LayeredStore < S > { … } // element-wise Σ partials
impl < S : BitPartials > BitPartials for LayeredStore < S > { … } // element-wise Σ partials
```
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
**Aggregation hierarchy: **
```
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
```
2026-05-14 09:24:25 +08:00
2026-05-17 10:20:22 +08:00
**Leaf implementors ** (in `obicompactvec` ):
| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()` ) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()` ) + `BitPartials` |
`PersistentCompactIntVec` and `PersistentBitVec` do not implement these traits — they are single-column primitives, not matrix-level aggregators.
See [Kmer index architecture ](../architecture/index_architecture.md ) for the full trait API and the two-pass normalised-metric pattern.
2026-05-13 06:24:43 +08:00
---
2026-05-17 10:20:22 +08:00
## On-disk structure
2026-05-09 17:20:08 +08:00
```
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
2026-05-17 10:20:22 +08:00
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
counts/ [mode 2] PersistentCompactIntMatrix
2026-05-14 09:31:11 +08:00
meta.json {"n": N, "n_cols": 1}
col_000000.pciv
2026-05-17 10:20:22 +08:00
presence/ [mode 3] PersistentBitMatrix
2026-05-14 09:31:11 +08:00
meta.json {"n": N, "n_cols": G}
col_000000.pbiv
2026-05-17 10:20:22 +08:00
…
2026-05-09 17:20:08 +08:00
layer_1/
2026-05-17 10:20:22 +08:00
…
2026-05-09 17:20:08 +08:00
part_00001/
2026-05-17 10:20:22 +08:00
…
2026-05-09 17:20:08 +08:00
```
2026-05-17 10:20:22 +08:00
**Partition ** (`part_XXXXX/` ): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
**Layer ** (`layer_N/` ): one `MphfLayer` plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
2026-05-09 17:20:08 +08:00
---
2026-05-17 10:20:22 +08:00
## Evidence encoding
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
2026-05-09 17:20:08 +08:00
```
2026-05-17 10:20:22 +08:00
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
2026-05-09 17:20:08 +08:00
```
2026-05-17 10:20:22 +08:00
Decoding: `chunk_id = raw >> 7` , `rank = raw & 0x7F` . Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` .
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
2026-05-09 17:20:08 +08:00
2026-05-13 06:24:43 +08:00
---
## ptr_hash configuration
``` rust
type Mphf = PtrHash <
2026-05-14 09:24:25 +08:00
u64 , // key type: canonical kmer raw encoding
2026-05-17 10:20:22 +08:00
CubicEps , // bucket fn: 2.4 bits/key, λ=3.5, α =0.99
CachelineEfVec < Vec < CachelineEf > > , // remap: 11.6 bits/entry (Elias-Fano)
Xx64 , // hasher: XXH3-64 with seed
2026-05-14 09:24:25 +08:00
Vec < u8 > , // pilots
2026-05-13 06:24:43 +08:00
> ;
```
2026-05-17 10:20:22 +08:00
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5) is a balanced tradeoff: 2× slower construction than `Linear/λ=3.0` , 20% less space.
2026-05-09 17:20:08 +08:00
---
## Query path
2026-05-14 09:24:25 +08:00
``` rust
2026-05-17 10:20:22 +08:00
pub fn query ( & self , kmer : CanonicalKmer ) -> Option < Hit < D ::Item > > {
self . mphf . find ( kmer ) . map ( | slot | Hit { slot , data : self . data . read ( slot ) } )
}
2026-05-09 17:20:08 +08:00
```
2026-05-17 10:20:22 +08:00
`MphfLayer::find` probes the MPHF, decodes evidence, and verifies the kmer — returning `Some(slot)` on match, `None` otherwise. `data.read(slot)` is called only on a confirmed hit.
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
In `LayeredMap` , layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
2026-05-14 09:31:11 +08:00
2026-05-09 17:20:08 +08:00
---
## Add-layer algorithm
When adding dataset B to an existing index:
2026-05-17 10:20:22 +08:00
1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index` .
3. Write `B \ index` to a new `unitigs.bin` via `MphfLayer::unitig_writer` .
4. Call `Layer<D>::build` on the new directory.
5. Update `meta.json` .
2026-05-09 17:20:08 +08:00
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Dependencies
| crate | role |
|---|---|
2026-05-17 10:20:22 +08:00
| `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation |
| `memmap2 0.9` | mmap of evidence and payload files |
2026-05-13 06:24:43 +08:00
| `obiskio` | unitig file writer/reader |
2026-05-17 10:20:22 +08:00
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `ndarray 0.16` | aggregation output arrays |