2026-05-09 17:20:08 +08:00
# obilayeredmap — layered kmer index crate
## Purpose
2026-05-23 13:24:25 +02:00
`obilayeredmap` implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a `ptr_hash` MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
2026-05-09 17:20:08 +08:00
---
2026-05-17 10:20:22 +08:00
## Three usage modes
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
The MPHF + evidence infrastructure is the same for all modes. The **payload ** varies.
2026-05-13 06:24:43 +08:00
2026-05-14 09:31:11 +08:00
| Mode | Description | Payload type | Storage |
2026-05-14 09:24:25 +08:00
|---|---|---|---|
| 1. Set | membership test only | `()` | — |
2026-05-14 09:31:11 +08:00
| 2. Count | occurrences per kmer per sample | `PersistentCompactIntMatrix` | `counts/` directory |
2026-05-17 10:20:22 +08:00
| 3. Presence/absence | which genomes contain each kmer | `PersistentBitMatrix` | `presence/` directory |
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obicompactvec` crate.
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
---
2026-05-26 14:26:19 +02:00
## Index mode (homogeneity invariant)
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.
2026-05-23 13:24:25 +02:00
``` rust
2026-05-26 14:26:19 +02:00
#[ derive(Serialize, Deserialize, Default) ]
#[ serde(tag = " type " , rename_all = " snake_case " ) ]
pub enum IndexMode {
#[ default ]
2026-05-23 13:24:25 +02:00
Exact ,
Approx { b : u8 , z : u8 } ,
2026-05-26 14:26:19 +02:00
Hybrid { b : u8 , z : u8 } ,
2026-05-23 13:24:25 +02:00
}
```
2026-05-26 14:26:19 +02:00
`IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json` .
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx` . Zero false positives.
- **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx` . `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.
2026-05-23 13:24:25 +02:00
---
2026-05-17 10:20:22 +08:00
## MphfLayer — autonomous kmer → slot mapping
2026-05-23 13:24:25 +02:00
`MphfLayer` encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
2026-05-17 10:20:22 +08:00
``` rust
pub struct MphfLayer {
2026-05-23 13:24:25 +02:00
mphf : Mphf ,
ev : LayerEvidence , // loaded at open() time
n : usize ,
2026-05-17 10:20:22 +08:00
}
```
2026-05-23 13:24:25 +02:00
`LayerEvidence` is an internal enum, not public:
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
``` rust
2026-05-23 13:24:25 +02:00
enum LayerEvidence {
Exact { evidence : Evidence , unitigs : UnitigFileReader } ,
2026-05-26 14:26:19 +02:00
Approx { fingerprint : FingerprintVec , unitigs_path : PathBuf } ,
Hybrid { evidence : Evidence , unitigs : UnitigFileReader , fingerprint : FingerprintVec } ,
2026-05-17 10:20:22 +08:00
}
```
2026-05-13 06:24:43 +08:00
2026-05-26 14:26:19 +02:00
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
2026-05-23 13:24:25 +02:00
### Query API
2026-05-26 14:26:19 +02:00
Two public query methods, both returning `Option<usize>` (slot index):
2026-05-23 13:24:25 +02:00
``` rust
pub fn find ( & self , kmer : CanonicalKmer ) -> Option < usize >
2026-05-26 14:26:19 +02:00
pub fn find_strict ( & self , kmer : CanonicalKmer ) -> Option < usize >
2026-05-23 13:24:25 +02:00
```
2026-05-26 14:26:19 +02:00
- `find` : O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
- `find_strict` : always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx` ).
2026-05-13 06:24:43 +08:00
2026-05-26 14:26:19 +02:00
There are no `find_exact` /`find_approx` methods; panicking dispatch is eliminated.
2026-05-13 06:24:43 +08:00
2026-05-23 13:24:25 +02:00
### Build surface
2026-05-13 06:24:43 +08:00
2026-05-23 13:24:25 +02:00
``` rust
2026-05-26 14:26:19 +02:00
// Full MPHF + evidence build (two-pass)
pub ( crate ) fn build ( dir , block_bits , mode : & IndexMode , fill_slot ) -> OLMResult < usize >
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
// Evidence-only post-hoc builds (MPHF already present)
2026-05-23 13:24:25 +02:00
pub fn build_exact_evidence ( dir , block_bits ) -> OLMResult < usize >
pub fn build_approx_evidence ( dir , b , z ) -> OLMResult < usize >
```
2026-05-26 14:26:19 +02:00
`MphfLayer::build` runs two passes over `unitigs.bin` :
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
1. **Pass 1 ** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>` , no file reopening) is passed to `new_from_par_iter` via `par_bridge()` . Produces `mphf.bin` . No `.idx` is read or created at this stage.
2. **Pass 2 ** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.
2026-05-23 13:24:25 +02:00
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
2026-05-13 06:24:43 +08:00
---
2026-05-17 10:20:22 +08:00
## Layer\<D: LayerData\> — MPHF + payload
2026-05-13 06:24:43 +08:00
2026-05-17 10:20:22 +08:00
`Layer<D>` pairs an `MphfLayer` with one payload store.
2026-05-13 06:24:43 +08:00
``` rust
2026-05-14 09:24:25 +08:00
pub trait LayerData : Sized {
type Item ;
fn open ( layer_dir : & Path ) -> OLMResult < Self > ;
fn read ( & self , slot : usize ) -> Self ::Item ;
}
pub struct Layer < D : LayerData = ( ) > {
2026-05-17 10:20:22 +08:00
mphf : MphfLayer ,
data : D ,
2026-05-13 06:24:43 +08:00
}
2026-05-14 09:24:25 +08:00
pub struct Hit < T = ( ) > {
pub slot : usize ,
pub data : T ,
}
2026-05-13 06:24:43 +08:00
```
2026-05-23 13:24:25 +02:00
`LayerData` covers the **read path only ** (`open` + `read` ). Build signatures differ between modes and are not part of the trait.
2026-05-13 06:24:43 +08:00
2026-05-14 09:24:25 +08:00
| Type | `Item` | Description |
|---|---|---|
| `()` | `()` | mode 1 — membership only |
2026-05-17 10:20:22 +08:00
| `PersistentCompactIntMatrix` | `Box<[u32]>` | mode 2 — count matrix (one u32 per column per slot) |
| `PersistentBitMatrix` | `Box<[bool]>` | mode 3 — presence matrix (one bit per genome per slot) |
2026-05-23 13:24:25 +02:00
### Build signatures
2026-05-17 10:20:22 +08:00
``` rust
// mode 1
impl Layer < ( ) > {
2026-05-26 14:26:19 +02:00
pub fn build ( out_dir : & Path , block_bits : u8 , mode : & IndexMode ) -> OLMResult < usize >
2026-05-17 10:20:22 +08:00
}
// mode 2
impl Layer < PersistentCompactIntMatrix > {
2026-05-26 14:26:19 +02:00
pub fn build ( out_dir : & Path , block_bits : u8 , mode : & IndexMode ,
2026-05-23 13:24:25 +02:00
count_of : impl Fn ( CanonicalKmer ) -> u32 ) -> OLMResult < usize >
2026-05-26 14:26:19 +02:00
pub fn build_from_map ( out_dir : & Path , block_bits : u8 , mode : & IndexMode ,
2026-05-23 13:24:25 +02:00
counts : & HashMap < CanonicalKmer , u32 > ) -> OLMResult < usize >
2026-05-17 10:20:22 +08:00
}
// mode 3
impl Layer < PersistentBitMatrix > {
2026-05-26 14:26:19 +02:00
pub fn build_presence ( out_dir : & Path , block_bits : u8 , mode : & IndexMode ,
2026-05-23 13:24:25 +02:00
n_genomes : usize ,
present_in : impl Fn ( CanonicalKmer , usize ) -> bool ) -> OLMResult < usize >
}
```
2026-05-26 14:26:19 +02:00
All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.
2026-05-23 13:24:25 +02:00
2026-05-26 14:26:19 +02:00
Evidence-only post-hoc builds are accessible directly on `Layer<D>` :
2026-05-23 13:24:25 +02:00
``` rust
impl < D : LayerData > Layer < D > {
pub fn build_exact_evidence ( layer_dir : & Path , block_bits : u8 ) -> OLMResult < usize >
pub fn build_approx_evidence ( layer_dir : & Path , b : u8 , z : u8 ) -> OLMResult < usize >
}
```
2026-05-26 14:26:19 +02:00
There is no `build_evidence` dispatch wrapper.
2026-05-23 13:24:25 +02:00
---
## FingerprintVec and FingerprintVecWriter
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
```
fingerprint.bin format:
magic: b"FPVF" (4 bytes)
b: u8 (bits per fingerprint, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
```
``` rust
impl FingerprintVec {
pub fn open ( path : & Path ) -> OLMResult < Self >
pub fn get ( & self , slot : usize ) -> u64
pub fn matches ( & self , slot : usize , fingerprint : u64 ) -> bool
pub fn n ( & self ) -> usize
pub fn b ( & self ) -> u8
}
```
`matches(slot, hash)` extracts the b-bit fingerprint stored at `slot` and compares it to the low b bits of `hash` . It is the core operation of `find_approx` .
---
## LayeredMap\<D\> — collection of layers
`LayeredMap<D>` wraps `Vec<Layer<D>>` for a single partition directory.
``` rust
pub struct LayeredMap < D : LayerData = ( ) > {
root : PathBuf ,
meta : PartitionMeta ,
layers : Vec < Layer < D > > ,
}
```
`PartitionMeta` (`meta.json` at the partition root) stores `n_layers` .
### Common methods
``` rust
2026-05-26 14:26:19 +02:00
pub fn open ( root : & Path ) -> OLMResult < Self >
pub fn create ( root : & Path , mode : IndexMode ) -> OLMResult < Self >
pub fn n_layers ( & self ) -> usize
pub fn layer ( & self , i : usize ) -> & Layer < D >
pub fn mode ( & self ) -> & IndexMode
2026-05-23 13:24:25 +02:00
pub fn query ( & self , kmer : CanonicalKmer ) -> Option < ( usize , Hit < D ::Item > ) >
2026-05-26 14:26:19 +02:00
pub fn next_layer_writer ( & self ) -> OLMResult < UnitigFileWriter >
2026-05-23 13:24:25 +02:00
```
2026-05-26 14:26:19 +02:00
`open` reads `PartitionMeta` once, extracts `mode` , and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta` .
2026-05-23 13:24:25 +02:00
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
### push_layer
`push_layer` builds the next layer from a `unitigs.bin` already written via `next_layer_writer` , using `DEFAULT_BLOCK_BITS` :
``` rust
// mode 1
impl LayeredMap < ( ) > {
pub fn push_layer ( & mut self ) -> OLMResult < usize >
}
// mode 2
impl LayeredMap < PersistentCompactIntMatrix > {
pub fn push_layer ( & mut self , count_of : impl Fn ( CanonicalKmer ) -> u32 ) -> OLMResult < usize >
pub fn push_layer_from_map ( & mut self , counts : & HashMap < CanonicalKmer , u32 > ) -> OLMResult < usize >
2026-05-17 10:20:22 +08:00
}
```
2026-05-23 13:24:25 +02:00
Mode 3 (`PersistentBitMatrix` ) has no `push_layer` on `LayeredMap` ; callers build directly via `Layer<PersistentBitMatrix>::build_presence` .
2026-05-17 10:20:22 +08:00
---
## LayeredStore\<S\> and aggregation traits
`LayeredStore<S>` is a generic aggregation wrapper over `Vec<S>` . It propagates three traits from `obicompactvec::traits` up the hierarchy via blanket impls:
``` rust
pub struct LayeredStore < S > ( pub Vec < S > ) ;
impl < S : ColumnWeights > ColumnWeights for LayeredStore < S > { … } // Σ col_weights across inner stores
impl < S : CountPartials > CountPartials for LayeredStore < S > { … } // element-wise Σ partials
impl < S : BitPartials > BitPartials for LayeredStore < S > { … } // element-wise Σ partials
```
Because blanket impls compose, `LayeredStore<LayeredStore<S>>` automatically inherits all three traits when `S` does — providing the partitioned level without a separate type.
**Leaf implementors ** (in `obicompactvec` ):
| Type | Traits |
|---|---|
| `PersistentCompactIntMatrix` | `ColumnWeights` (via `sum()` ) + `CountPartials` |
| `PersistentBitMatrix` | `ColumnWeights` (via `count_ones()` ) + `BitPartials` |
See [Kmer index architecture ](../architecture/index_architecture.md ) for the full trait API and the two-pass normalised-metric pattern.
2026-05-13 06:24:43 +08:00
---
2026-05-17 10:20:22 +08:00
## On-disk structure
2026-05-09 17:20:08 +08:00
```
2026-05-23 13:24:25 +02:00
partition_root/ ← LayeredMap (one partition)
2026-05-26 14:26:19 +02:00
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
2026-05-23 13:24:25 +02:00
layer_0/ ← Layer
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
2026-05-26 14:26:19 +02:00
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
evidence.bin — [u32; n], LE (Exact/Hybrid only)
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
2026-05-23 13:24:25 +02:00
counts/ [mode 2] PersistentCompactIntMatrix
meta.json
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json
col_000000.pbiv …
layer_1/
2026-05-17 10:20:22 +08:00
…
2026-05-09 17:20:08 +08:00
```
2026-05-26 14:26:19 +02:00
There is no `layer_meta.json` . The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.
2026-05-09 17:20:08 +08:00
---
2026-05-23 13:24:25 +02:00
## Evidence encoding (exact)
2026-05-09 17:20:08 +08:00
2026-05-17 10:20:22 +08:00
`evidence.bin` is a flat `[u32; n]` array with no header. Each u32 encodes one slot:
2026-05-09 17:20:08 +08:00
```
2026-05-17 10:20:22 +08:00
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
2026-05-09 17:20:08 +08:00
```
2026-05-23 13:24:25 +02:00
`chunk_id = raw >> 7` , `rank = raw & 0x7F` . Reconstructing the kmer: read k nucleotides at position `rank` within unitig `chunk_id` (requires `unitigs.bin.idx` for random access).
2026-05-09 17:20:08 +08:00
2026-05-23 13:24:25 +02:00
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
2026-05-09 17:20:08 +08:00
2026-05-13 06:24:43 +08:00
---
## ptr_hash configuration
``` rust
type Mphf = PtrHash <
2026-05-14 09:24:25 +08:00
u64 , // key type: canonical kmer raw encoding
2026-05-17 10:20:22 +08:00
CubicEps , // bucket fn: 2.4 bits/key, λ=3.5, α =0.99
2026-05-23 13:24:25 +02:00
CachelineEfVec < Vec < CachelineEf > > , // remap: Elias-Fano
2026-05-17 10:20:22 +08:00
Xx64 , // hasher: XXH3-64 with seed
2026-05-14 09:24:25 +08:00
Vec < u8 > , // pilots
2026-05-13 06:24:43 +08:00
> ;
```
2026-05-17 10:20:22 +08:00
`Xx64` is chosen over `FxHash` because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
2026-05-09 17:20:08 +08:00
2026-05-23 13:24:25 +02:00
`CubicEps` with `PtrHashParams::<CubicEps>::default()` (λ=3.5): 2× slower construction than `Linear/λ=3.0` , ~20% less space.
2026-05-09 17:20:08 +08:00
---
2026-05-23 13:24:25 +02:00
## Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF.
### Layer-level genome column append
``` rust
impl Layer < PersistentBitMatrix > {
pub fn append_genome_column ( layer_dir : & Path , value_of : impl Fn ( usize ) -> bool ) -> OLMResult < ( ) >
}
impl Layer < PersistentCompactIntMatrix > {
pub fn append_genome_column ( layer_dir : & Path , value_of : impl Fn ( usize ) -> u32 ) -> OLMResult < ( ) >
}
```
Both delegate to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column` . They write a new column file (`col_NNNNNN.pbiv` / `col_NNNNNN.pciv` ) and update `meta.json` to increment `n_cols` . `value_of` is called once per slot (0..n).
### Presence matrix initialisation
2026-05-09 17:20:08 +08:00
2026-05-14 09:24:25 +08:00
``` rust
2026-05-23 13:24:25 +02:00
impl Layer < ( ) > {
pub fn init_presence_matrix ( layer_dir : & Path , n_kmers : usize ) -> OLMResult < ( ) >
2026-05-17 10:20:22 +08:00
}
2026-05-09 17:20:08 +08:00
```
2026-05-23 13:24:25 +02:00
Called on the first merge of a Presence-mode index. Creates `presence/` with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true` . This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
### Why the MPHF is never rebuilt
2026-05-09 17:20:08 +08:00
2026-05-23 13:24:25 +02:00
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new `.pciv` /`.pbiv` file and a single `meta.json` update.
2026-05-14 09:31:11 +08:00
2026-05-09 17:20:08 +08:00
---
## Add-layer algorithm
When adding dataset B to an existing index:
2026-05-17 10:20:22 +08:00
1. For each partition, probe existing layers for kmers of B routed to that partition.
2. Collect kmers absent from all layers → `B \ index` .
2026-05-23 13:24:25 +02:00
3. Write `B \ index` to a new `unitigs.bin` via `next_layer_writer()` .
4. Call `Layer<D>::build` (or `build_presence` ) on the new layer directory.
5. Call `push_layer` (or `append_layer` ) to register the new layer in `meta.json` .
2026-05-09 17:20:08 +08:00
Each partition's new layer is built independently; the operation is fully parallel across partitions.
---
## Dependencies
| crate | role |
|---|---|
2026-05-17 10:20:22 +08:00
| `ptr_hash 1.1` | MPHF per layer |
| `cacheline-ef 1.1` | compact remap inside ptr_hash |
| `epserde 0.8` | zero-copy MPHF serialisation |
2026-05-23 13:24:25 +02:00
| `memmap2 0.9` | mmap of evidence and fingerprint files |
| `bitvec` | packed b-bit fingerprint storage |
| `obiskio` | unitig file writer/reader + `.idx` build |
2026-05-17 10:20:22 +08:00
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
2026-05-26 14:26:19 +02:00
| `serde / serde_json` | `PartitionMeta` serialisation |