Files
obikmer/docmd/implementation/obilayeredmap.md
T
Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits
Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.
2026-05-17 15:59:10 +08:00

9.4 KiB
Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.


Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

Mode Description Payload type Storage
1. Set membership test only ()
2. Count occurrences per kmer per sample PersistentCompactIntMatrix counts/ directory
3. Presence/absence which genomes contain each kmer PersistentBitMatrix presence/ directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.


MphfLayer — autonomous kmer → slot mapping

MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.

pub struct MphfLayer {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    n:        usize,   // number of indexed kmers = number of MPHF slots
}

Public API:

impl MphfLayer {
    pub fn open(dir: &Path) -> OLMResult<Self>
    pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>   // Some(slot) or None
    pub fn n(&self) -> usize
    pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
    pub(crate) fn build(
        dir: &Path,
        fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
    ) -> OLMResult<usize>
}

find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).

build runs two sequential passes over unitigs.bin:

  1. Pass 1: iterate all canonical kmers in parallel via rayon, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
  2. Pass 2: iterate again sequentially, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.

For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.


Layer<D: LayerData> — MPHF + payload

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.

Type Item Description
() () mode 1 — membership only
PersistentCompactIntMatrix Box<[u32]> mode 2 — count matrix (one u32 per column per slot)
PersistentBitMatrix Box<[bool]> mode 3 — presence matrix (one bit per genome per slot)

Build signatures:

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(
        out_dir: &Path,
        n_genomes: usize,
        present_in: impl Fn(CanonicalKmer, usize) -> bool,
    ) -> OLMResult<usize>
}

All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.


LayeredStore<S> and aggregation traits

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> {  }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> {  }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> {  }  // element-wise Σ partials

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

Aggregation hierarchy:

PersistentCompactIntMatrix                  implements CountPartials
LayeredStore<PersistentCompactIntMatrix>         via blanket impl  (one partition)
LayeredStore<LayeredStore<…>>                    via blanket impl  (partitioned index)

Leaf implementors (in obicompactvec):

Type Traits
PersistentCompactIntMatrix ColumnWeights (via sum()) + CountPartials
PersistentBitMatrix ColumnWeights (via count_ones()) + BitPartials

PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.


On-disk structure

index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin           — ptr_hash MPHF (epserde format)
      unitigs.bin        — packed 2-bit nucleotide sequences
      unitigs.bin.idx    — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
      evidence.bin       — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
      counts/            [mode 2] PersistentCompactIntMatrix
        meta.json          {"n": N, "n_cols": 1}
        col_000000.pciv
      presence/          [mode 3] PersistentBitMatrix
        meta.json          {"n": N, "n_cols": G}
        col_000000.pbiv
        …
    layer_1/
      …
  part_00001/
    …

Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.

Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.


Evidence encoding

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)

Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.


ptr_hash configuration

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.


Query path

pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
    self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
}

MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.

In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.


Add-layer algorithm

When adding dataset B to an existing index:

  1. For each partition, probe existing layers for kmers of B routed to that partition.
  2. Collect kmers absent from all layers → B \ index.
  3. Write B \ index to a new unitigs.bin via MphfLayer::unitig_writer.
  4. Call Layer<D>::build on the new directory.
  5. Update meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.


Dependencies

crate role
ptr_hash 1.1 MPHF per layer
cacheline-ef 1.1 compact remap inside ptr_hash
epserde 0.8 zero-copy MPHF serialisation
memmap2 0.9 mmap of evidence and payload files
obiskio unitig file writer/reader
obicompactvec payload types + aggregation traits
rayon 1 parallel MPHF construction pass
ndarray 0.16 aggregation output arrays