Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
14 KiB
obilayeredmap — layered kmer index crate
Purpose
obilayeredmap implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
Three usage modes
The MPHF + evidence infrastructure is the same for all modes. The payload varies.
| Mode | Description | Payload type | Storage |
|---|---|---|---|
| 1. Set | membership test only | () |
— |
| 2. Count | occurrences per kmer per sample | PersistentCompactIntMatrix |
counts/ directory |
| 3. Presence/absence | which genomes contain each kmer | PersistentBitMatrix |
presence/ directory |
Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.
Index mode (homogeneity invariant)
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at LayeredMap::open() from PartitionMeta.mode and passed to each Layer::open() — no per-layer file is read.
#[derive(Serialize, Deserialize, Default)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum IndexMode {
#[default]
Exact,
Approx { b: u8, z: u8 },
Hybrid { b: u8, z: u8 },
}
IndexMode is stored once in PartitionMeta (meta.json at partition root). There is no layer_meta.json.
- Exact: writes
evidence.bin+unitigs.bin.idx. Zero false positives. - Approx: writes
fingerprint.binonly. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No.idxwritten or required. - Hybrid: writes both
fingerprint.binandevidence.bin+.idx.find()uses the fingerprint (fast, O(1));find_strict()uses exact evidence.
MphfLayer — autonomous kmer → slot mapping
MphfLayer encapsulates the MPHF and evidence store for one layer. It is independent of any payload.
pub struct MphfLayer {
mphf: Mphf,
ev: LayerEvidence, // loaded at open() time
n: usize,
}
LayerEvidence is an internal enum, not public:
enum LayerEvidence {
Exact { evidence: Evidence, unitigs: UnitigFileReader },
Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
}
MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read.
Query API
Two public query methods, both returning Option<usize> (slot index):
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
find: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.find_strict: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no.idx).
There are no find_exact/find_approx methods; panicking dispatch is eliminated.
Build surface
// Full MPHF + evidence build (two-pass)
pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>
// Evidence-only post-hoc builds (MPHF already present)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
MphfLayer::build runs two passes over unitigs.bin:
- Pass 1 (parallel via rayon): a
CanonicalKmerIter(clonable,Arc<Mmap>, no file reopening) is passed tonew_from_par_iterviapar_bridge(). Producesmphf.bin. No.idxis read or created at this stage. - Pass 2 (sequential): fill evidence files; call
fill_slot(slot, kmer)per kmer..idxis written last for Exact/Hybrid modes (query-time only).
There is no build_evidence dispatch wrapper — callers invoke build_exact_evidence or build_approx_evidence directly.
For empty layers (n = 0), all build variants return Ok(0) immediately after creating empty output files.
Layer<D: LayerData> — MPHF + payload
Layer<D> pairs an MphfLayer with one payload store.
pub trait LayerData: Sized {
type Item;
fn open(layer_dir: &Path) -> OLMResult<Self>;
fn read(&self, slot: usize) -> Self::Item;
}
pub struct Layer<D: LayerData = ()> {
mphf: MphfLayer,
data: D,
}
pub struct Hit<T = ()> {
pub slot: usize,
pub data: T,
}
LayerData covers the read path only (open + read). Build signatures differ between modes and are not part of the trait.
| Type | Item |
Description |
|---|---|---|
() |
() |
mode 1 — membership only |
PersistentCompactIntMatrix |
Box<[u32]> |
mode 2 — count matrix (one u32 per column per slot) |
PersistentBitMatrix |
Box<[bool]> |
mode 3 — presence matrix (one bit per genome per slot) |
Build signatures
// mode 1
impl Layer<()> {
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
}
// mode 2
impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
// mode 3
impl Layer<PersistentBitMatrix> {
pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
}
All build impls delegate to MphfLayer::build via a mode-specific fill_slot callback. The mode parameter is forwarded directly — no LayerMeta is written.
Evidence-only post-hoc builds are accessible directly on Layer<D>:
impl<D: LayerData> Layer<D> {
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
}
There is no build_evidence dispatch wrapper.
FingerprintVec and FingerprintVecWriter
Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.
fingerprint.bin format:
magic: b"FPVF" (4 bytes)
b: u8 (bits per fingerprint, 1..=64)
padding: [0u8; 3]
n: u64 LE (number of slots)
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
impl FingerprintVec {
pub fn open(path: &Path) -> OLMResult<Self>
pub fn get(&self, slot: usize) -> u64
pub fn matches(&self, slot: usize, fingerprint: u64) -> bool
pub fn n(&self) -> usize
pub fn b(&self) -> u8
}
matches(slot, hash) extracts the b-bit fingerprint stored at slot and compares it to the low b bits of hash. It is the core operation of find_approx.
LayeredMap<D> — collection of layers
LayeredMap<D> wraps Vec<Layer<D>> for a single partition directory.
pub struct LayeredMap<D: LayerData = ()> {
root: PathBuf,
meta: PartitionMeta,
layers: Vec<Layer<D>>,
}
PartitionMeta (meta.json at the partition root) stores n_layers.
Common methods
pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D>
pub fn mode(&self) -> &IndexMode
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
open reads PartitionMeta once, extracts mode, and passes it to every Layer::open — no per-layer file is read. create stores the given mode in PartitionMeta.
query probes layers in order and returns (layer_index, Hit) on the first match. Expected probe depth: 1 for kmers in layer 0.
push_layer
push_layer builds the next layer from a unitigs.bin already written via next_layer_writer, using DEFAULT_BLOCK_BITS:
// mode 1
impl LayeredMap<()> {
pub fn push_layer(&mut self) -> OLMResult<usize>
}
// mode 2
impl LayeredMap<PersistentCompactIntMatrix> {
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}
Mode 3 (PersistentBitMatrix) has no push_layer on LayeredMap; callers build directly via Layer<PersistentBitMatrix>::build_presence.
LayeredStore<S> and aggregation traits
LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:
pub struct LayeredStore<S>(pub Vec<S>);
impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … } // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials for LayeredStore<S> { … } // element-wise Σ partials
impl<S: BitPartials> BitPartials for LayeredStore<S> { … } // element-wise Σ partials
Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.
Leaf implementors (in obicompactvec):
| Type | Traits |
|---|---|
PersistentCompactIntMatrix |
ColumnWeights (via sum()) + CountPartials |
PersistentBitMatrix |
ColumnWeights (via count_ones()) + BitPartials |
See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.
On-disk structure
partition_root/ ← LayeredMap (one partition)
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
layer_0/ ← Layer
mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
evidence.bin — [u32; n], LE (Exact/Hybrid only)
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
counts/ [mode 2] PersistentCompactIntMatrix
meta.json
col_000000.pciv
presence/ [mode 3] PersistentBitMatrix
meta.json
col_000000.pbiv …
layer_1/
…
There is no layer_meta.json. The mode is stored once in PartitionMeta and is valid for all layers. unitigs.bin.idx is built at the end of build_exact_evidence — never during MPHF construction — and is consumed at query time only.
Evidence encoding (exact)
evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:
bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id (requires unitigs.bin.idx for random access).
For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.
ptr_hash configuration
type Mphf = PtrHash<
u64, // key type: canonical kmer raw encoding
CubicEps, // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
CachelineEfVec<Vec<CachelineEf>>, // remap: Elias-Fano
Xx64, // hasher: XXH3-64 with seed
Vec<u8>, // pilots
>;
Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.
CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5): 2× slower construction than Linear/λ=3.0, ~20% less space.
Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF.
Layer-level genome column append
impl Layer<PersistentBitMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
Both delegate to the corresponding PersistentBitMatrix::append_column / PersistentCompactIntMatrix::append_column. They write a new column file (col_NNNNNN.pbiv / col_NNNNNN.pciv) and update meta.json to increment n_cols. value_of is called once per slot (0..n).
Presence matrix initialisation
impl Layer<()> {
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
Called on the first merge of a Presence-mode index. Creates presence/ with meta.json {"n": n_kmers, "n_cols": 1} and col_000000.pbiv set entirely to true. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.
Why the MPHF is never rebuilt
The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new .pciv/.pbiv file and a single meta.json update.
Add-layer algorithm
When adding dataset B to an existing index:
- For each partition, probe existing layers for kmers of B routed to that partition.
- Collect kmers absent from all layers →
B \ index. - Write
B \ indexto a newunitigs.binvianext_layer_writer(). - Call
Layer<D>::build(orbuild_presence) on the new layer directory. - Call
push_layer(orappend_layer) to register the new layer inmeta.json.
Each partition's new layer is built independently; the operation is fully parallel across partitions.
Dependencies
| crate | role |
|---|---|
ptr_hash 1.1 |
MPHF per layer |
cacheline-ef 1.1 |
compact remap inside ptr_hash |
epserde 0.8 |
zero-copy MPHF serialisation |
memmap2 0.9 |
mmap of evidence and fingerprint files |
bitvec |
packed b-bit fingerprint storage |
obiskio |
unitig file writer/reader + .idx build |
obicompactvec |
payload types + aggregation traits |
rayon 1 |
parallel MPHF construction pass |
serde / serde_json |
PartitionMeta serialisation |