Introduce the `obilayeredmap` specification and persistent MPHF-based index architecture for incremental multi-dataset indexing. Implement chunked binary serialization with a fixed `u8` k-mer count limit (256) and overlapping super-kmer segments. Add memory-mapped I/O and a companion `.idx` index file for allocation-free, O(1) unitig access. Update MkDocs navigation, enhance the k-mer comparison script, and add comprehensive tests for serialization, partitioning, and file I/O pipelines.
7.1 KiB
obilayeredmap — layered kmer index crate
Purpose
obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: collection → partition → layer. Each layer covers a disjoint kmer set (kmers absent from all earlier layers), wrapping a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.
Three-level hierarchy
index_root/ ← LayeredMap (collection)
meta.json
part_00000/ ← Partition
layer_0/ ← Layer
mphf.bin
unitigs.bin
evidence.bin
counts.bin
presence.bin
layer_1/
...
part_00001/
layer_0/
layer_1/
...
Collection (index_root/): global metadata — kmer size k, number of partitions, layer count, sample registry.
Partition (part_XXXXX/): one directory per hash bucket. All kmers whose canonical minimiser hashes to bucket X land in part_XXXXX. Partitions are independent and can be processed in parallel. The partition count and routing scheme (minimiser → bucket) are fixed at collection creation and recorded in meta.json.
Layer (layer_N/): within a partition, a layer is the MPHF and its associated data for one dataset addition. Layer 0 is built from the first dataset A; layer 1 covers kmers in B not present in layer 0; and so on. Layers within a partition are disjoint: each kmer belongs to exactly one layer.
Layer file layout
layer_N/
mphf.bin — ptr_hash MPHF (exact key count, phase-2 construction)
unitigs.bin — packed 2-bit nucleotide sequences (concatenated, variable-length)
unitig_offsets.bin — u32 per unitig: nucleotide offset of unitig j in unitigs.bin
evidence.bin — u32 per MPHF slot: (unitig_id: 25 | rank: 7)
counts.bin — u32 per MPHF slot (total kmer occurrences)
presence.bin — bit matrix: n_slots × n_samples [optional]
Unitigs have variable lengths. Each record in unitigs.bin is self-delimiting: it begins with a varint seql (sequence length in nucleotides) followed by (seql+3)/4 packed bytes — the streaming format defined in obiskio. Sequential scan is always possible using the in-record seql.
For O(1) random access, unitig_offsets.bin is a precomputed derived index: a u32 array of byte offsets into unitigs.bin, with n_unitigs + 1 entries (sentinel = total byte size). Built once at construction by a single sequential scan; reconstructible from unitigs.bin if lost. Access: unitigs.bin[offsets[j] .. offsets[j+1]].
All files except mphf.bin are flat arrays of fixed-size elements, serialised with rkyv for zero-copy mmap access. mphf.bin uses ptr_hash's native serialisation; rkyv integration is deferred (see open questions).
Evidence encoding
Evidence maps each MPHF slot to its kmer's location in the unitig file. It serves two roles: membership verification (ptr_hash maps any input to a valid slot; decoding evidence and comparing to the query detects absent keys) and kmer reconstruction.
slot s → unitig_id: u25 | rank: u7
Packed into a u32 (29 bits used, 3 spare). Decoding:
kmer = unitigs[unitig_id][rank .. rank + k] // 2-bit packed slice
rank is the kmer's 0-based index within the unitig (kmer units, not nucleotides). For k=31, m=11, the structural maximum is k − m + 1 = 21 kmers per unitig; the empirical maximum observed is ~46 kmers. A u7 (0–127) is sufficient.
Presence/absence matrix
Column-major bit matrix: column j (sample j) is a contiguous n_slots-bit array. This layout makes per-sample operations (union, intersection, diff over a column) cache-friendly. For large matrices (e.g. 10 M slots × 1 000 samples ≈ 1.25 GB per partition), rkyv + mmap avoids loading the full matrix at open time.
Query path
A kmer query routes through all three levels:
- Partition routing: hash canonical minimiser of the query kmer → partition index → open
part_XXXXX/. - Layer probing: iterate layers in order within the partition; for each layer compute
slot = mphf.query(kmer), then verifyevidence.decode(slot) == kmer. First match wins. - Data access: read
counts[slot]and/orpresence[slot]from the matching layer.
fn query(kmer) → Option<Hit>:
part = partition_of(kmer)
for (i, layer) in part.layers.iter().enumerate():
slot = layer.mphf.query(kmer)
if layer.evidence.decode(slot) == kmer:
return Some(Hit { layer: i, slot })
return None
Expected probe depth: 1 for kmers in layer 0, increasing for later layers. In practice the dominant dataset should be layer 0.
Add-layer algorithm
When adding dataset B to an existing index:
- For each partition, iterate kmers of B routed to that partition.
- Probe existing layers; collect kmers absent from all layers →
B \ index. - Build a new layer from
B \ indexusing the two-phase pipeline (FMPHGO provisional → ptr_hash definitive). - Append the new layer directory under each
part_XXXXX/. - Update
meta.json(layer count, sample registry).
Each partition's new layer is built independently; the operation is fully parallel across partitions.
Core API (sketch)
// Open an existing index
let map = LayeredMap::open(path)?;
// Query a canonical kmer across all partitions and layers
match map.query(kmer) {
Some(hit) => {
let count = hit.count();
let present = hit.presence_row(); // bit slice over samples
}
None => { /* absent */ }
}
// Non-destructive extension with a new dataset
// unitigs produced by the two-phase pipeline, one per partition
let layer_idx = map.add_layer(unitigs_dir, counts_dir, presence_path)?;
Dependencies
| crate | role |
|---|---|
ptr_hash |
phase-2 MPHF per layer |
ph (FMPHGO) |
phase-1 provisional MPHF during layer construction |
rkyv |
zero-copy serialisation of flat arrays (evidence, counts, presence) |
memmap2 |
mmap of layer files |
bitm |
bit-packed presence matrix |
Serialisation strategy
All flat arrays use rkyv::Archive:
#[derive(Archive, Serialize, Deserialize)]
struct Evidence { slots: Vec<u32> } // packed (unitig_id: 25 | rank: 7)
#[derive(Archive, Serialize, Deserialize)]
struct Counts { data: Vec<u32> }
At open time, each file is mmapped and cast to its archived type — no allocation, no copy. The MPHF is loaded via ptr_hash's own API; a rkyv wrapper is a future refinement.
Open questions
- ptr_hash + rkyv: ptr_hash's internals are flat bit arrays; a rkyv-compatible wrapper is structurally feasible. Assess upstream willingness or implement a thin newtype wrapper.
- Presence matrix layout: column-major favours per-sample operations; row-major favours per-kmer queries. Decide based on dominant access pattern.
- Layer merge: merging two
LayeredMapinstances into a single-layer index requires full rebuild. Define API and cost model; maintenance operation, not query-path. - Canonical kmer orientation: evidence stores canonical kmer; strand recovery requires one 64-bit revcomp comparison at query time.