Files

T

Eric Coissac f36b095ce2 docs: clarify MPHF indexing, storage layout, and distance traits

Formalize the two-phase MPHF indexing architecture and update Phase 6 to use `evidence.bin` for direct kmer extraction. Simplify the evidence and unitig storage layouts to flat packed formats enabling O(1) random access. Introduce aggregation traits (`ColumnWeights`, `CountPartials`, `BitPartials`) to support additive distance metric decomposition across partitions. Narrow the documented scope from metagenomic to individual genome datasets, and replace speculative open questions with concrete implementation specifications.

2026-05-17 15:59:10 +08:00

9.4 KiB

Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

Mode	Description	Payload type	Storage
1. Set	membership test only	`()`	—
2. Count	occurrences per kmer per sample	`PersistentCompactIntMatrix`	`counts/` directory
3. Presence/absence	which genomes contain each kmer	`PersistentBitMatrix`	`presence/` directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.

MphfLayer — autonomous kmer → slot mapping

MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.

pub struct MphfLayer {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    n:        usize,   // number of indexed kmers = number of MPHF slots
}

Public API:

impl MphfLayer {
    pub fn open(dir: &Path) -> OLMResult<Self>
    pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>   // Some(slot) or None
    pub fn n(&self) -> usize
    pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
    pub(crate) fn build(
        dir: &Path,
        fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
    ) -> OLMResult<usize>
}

find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).

build runs two sequential passes over unitigs.bin:

Pass 1: iterate all canonical kmers in parallel via rayon, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
Pass 2: iterate again sequentially, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.

For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.

Layer<D: LayerData> — MPHF + payload

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.

Type	`Item`	Description
`()`	`()`	mode 1 — membership only
`PersistentCompactIntMatrix`	`Box<[u32]>`	mode 2 — count matrix (one u32 per column per slot)
`PersistentBitMatrix`	`Box<[bool]>`	mode 3 — presence matrix (one bit per genome per slot)

Build signatures:

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(
        out_dir: &Path,
        n_genomes: usize,
        present_in: impl Fn(CanonicalKmer, usize) -> bool,
    ) -> OLMResult<usize>
}

All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.

LayeredStore<S> and aggregation traits

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // element-wise Σ partials

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

Aggregation hierarchy:

PersistentCompactIntMatrix                  implements CountPartials
LayeredStore<PersistentCompactIntMatrix>         via blanket impl  (one partition)
LayeredStore<LayeredStore<…>>                    via blanket impl  (partitioned index)

Leaf implementors (in obicompactvec):

Type	Traits
`PersistentCompactIntMatrix`	`ColumnWeights` (via `sum()`) + `CountPartials`
`PersistentBitMatrix`	`ColumnWeights` (via `count_ones()`) + `BitPartials`

PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.

On-disk structure

index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin           — ptr_hash MPHF (epserde format)
      unitigs.bin        — packed 2-bit nucleotide sequences
      unitigs.bin.idx    — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
      evidence.bin       — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
      counts/            [mode 2] PersistentCompactIntMatrix
        meta.json          {"n": N, "n_cols": 1}
        col_000000.pciv
      presence/          [mode 3] PersistentBitMatrix
        meta.json          {"n": N, "n_cols": G}
        col_000000.pbiv
        …
    layer_1/
      …
  part_00001/
    …

Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.

Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.

Evidence encoding

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)

Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.

ptr_hash configuration

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.

Query path

pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
    self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
}

MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.

In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.

Add-layer algorithm

When adding dataset B to an existing index:

For each partition, probe existing layers for kmers of B routed to that partition.
Collect kmers absent from all layers → B \ index.
Write B \ index to a new unitigs.bin via MphfLayer::unitig_writer.
Call Layer<D>::build on the new directory.
Update meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.

Dependencies

crate	role
`ptr_hash 1.1`	MPHF per layer
`cacheline-ef 1.1`	compact remap inside ptr_hash
`epserde 0.8`	zero-copy MPHF serialisation
`memmap2 0.9`	mmap of evidence and payload files
`obiskio`	unitig file writer/reader
`obicompactvec`	payload types + aggregation traits
`rayon 1`	parallel MPHF construction pass
`ndarray 0.16`	aggregation output arrays

9.4 KiB Raw Blame History Unescape Escape