obilayeredmap — layered kmer index crate
Purpose
obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
Three usage modes
The MPHF + evidence infrastructure is the same for all modes. The payload varies.
| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | () |
— |
| 2. Count | occurrences per kmer per sample | PersistentCompactIntMatrix |
counts/ directory |
| 3. Presence/absence | which genomes contain each kmer | PersistentBitMatrix |
presence/ directory |
Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.
MphfLayer — autonomous kmer → slot mapping
MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.
pub struct MphfLayer {
mphf: Mphf,
evidence: Evidence,
unitigs: UnitigFileReader,
n: usize, // number of indexed kmers = number of MPHF slots
}
Public API:
impl MphfLayer {
pub fn open(dir: &Path) -> OLMResult<Self>
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> // Some(slot) or None
pub fn n(&self) -> usize
pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
pub(crate) fn build(
dir: &Path,
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
) -> OLMResult<usize>
}
find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).
build runs two sequential passes over unitigs.bin:
- Pass 1: iterate all canonical kmers in parallel via rayon, construct and store
mphf.bin.new_from_par_iteravoids materialising a full keyVec. - Pass 2: iterate again sequentially, fill
evidence.bin, callfill_slot(slot, kmer)once per kmer for payload population. A compactn/8-byte seen-bitset verifies MPHF injectivity inline.
For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.
Layer\<D: LayerData> — MPHF + payload
Layer<D> pairs an MphfLayer with one payload store.
pub trait LayerData: Sized {
type Item;
fn open(layer_dir: &Path) -> OLMResult<Self>;
fn read(&self, slot: usize) -> Self::Item;
}
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
pub struct Hit<T = ()> {
pub slot: usize,
pub data: T,
}
LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.
| Type | Item |
Description |
|---|---|---|
() |
() |
mode 1 — membership only |
PersistentCompactIntMatrix |
Box<[u32]> |
mode 2 — count matrix (one u32 per column per slot) |
PersistentBitMatrix |
Box<[bool]> |
mode 3 — presence matrix (one bit per genome per slot) |
Build signatures:
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
// mode 3
impl Layer<PersistentBitMatrix> {
pub fn build_presence(
out_dir: &Path,
n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool,
) -> OLMResult<usize>
}
All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.
LayeredStore\<S> and aggregation traits
LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:
pub struct LayeredStore<S>(pub Vec<S>);
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // element-wise Σ partials
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wise Σ partials
Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.
Aggregation hierarchy:
PersistentCompactIntMatrix implements CountPartials
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
Leaf implementors (in obicompactvec):
| Type | Traits |
|---|---|
PersistentCompactIntMatrix |
ColumnWeights (via sum()) + CountPartials |
PersistentBitMatrix |
ColumnWeights (via count_ones()) + BitPartials |
PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.
See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.
On-disk structure
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
counts/ [mode 2] PersistentCompactIntMatrix
meta.json {"n": N, "n_cols": 1}
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json {"n": N, "n_cols": G}
col_000000.pbiv
…
layer_1/
…
part_00001/
…
Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.
Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.
Evidence encoding
evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.
ptr_hash configuration
type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots
>;
Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.
Query path
pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
}
MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.
In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.
Add-layer algorithm
When adding dataset B to an existing index:
- For each partition, probe existing layers for kmers of B routed to that partition.
- Collect kmers absent from all layers →
B \ index. - Write
B \ indexto a newunitigs.binviaMphfLayer::unitig_writer. - Call
Layer<D>::buildon the new directory. - Update
meta.json.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
Dependencies
| crate | role |
|---|---|
ptr_hash 1.1 |
MPHF per layer |
cacheline-ef 1.1 |
compact remap inside ptr_hash |
epserde 0.8 |
zero-copy MPHF serialisation |
memmap2 0.9 |
mmap of evidence and payload files |
obiskio |
unitig file writer/reader |
obicompactvec |
payload types + aggregation traits |
rayon 1 |
parallel MPHF construction pass |
ndarray 0.16 |
aggregation output arrays |