Files

T

Eric Coissac e1d59fde54 feat: add merge command to consolidate k-mer indexes

Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.

2026-05-21 13:44:50 +02:00

12 KiB

Raw Blame History

obilayeredmap — layered kmer index crate

Purpose

obilayeredmap implements a persistent, incrementally extensible kmer index. The index is organised in three levels: index root → partition → layer. Each layer covers a disjoint kmer set and wraps a ptr_hash MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.

Three usage modes

The MPHF + evidence infrastructure is the same for all modes. The payload varies.

Mode	Description	Payload type	Storage
1. Set	membership test only	`()`	—
2. Count	occurrences per kmer per sample	`PersistentCompactIntMatrix`	`counts/` directory
3. Presence/absence	which genomes contain each kmer	`PersistentBitMatrix`	`presence/` directory

Both PersistentCompactIntMatrix and PersistentBitMatrix come from the obicompactvec crate.

MphfLayer — autonomous kmer → slot mapping

MphfLayer encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.

pub struct MphfLayer {
    mphf:     Mphf,
    evidence: Evidence,
    unitigs:  UnitigFileReader,
    n:        usize,   // number of indexed kmers = number of MPHF slots
}

Public API:

impl MphfLayer {
    pub fn open(dir: &Path) -> OLMResult<Self>
    pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>   // Some(slot) or None
    pub fn n(&self) -> usize
    pub fn unitig_writer(dir: &Path) -> OLMResult<UnitigFileWriter>
    pub(crate) fn build(
        dir: &Path,
        fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
    ) -> OLMResult<usize>
}

find returns Some(slot) only after verifying via evidence that the kmer is actually indexed. It returns None for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).

build runs two sequential passes over unitigs.bin:

Pass 1: iterate all canonical kmers in parallel via rayon, construct and store mphf.bin. new_from_par_iter avoids materialising a full key Vec.
Pass 2: iterate again sequentially, fill evidence.bin, call fill_slot(slot, kmer) once per kmer for payload population. A compact n/8-byte seen-bitset verifies MPHF injectivity inline.

For empty layers (n = 0), build returns Ok(0) immediately after creating empty mphf.bin and evidence.bin.

Layer<D: LayerData> — MPHF + payload

Layer<D> pairs an MphfLayer with one payload store.

pub trait LayerData: Sized {
    type Item;
    fn open(layer_dir: &Path) -> OLMResult<Self>;
    fn read(&self, slot: usize) -> Self::Item;
}

pub struct Layer<D: LayerData = ()> {
    mphf: MphfLayer,
    data: D,
}

pub struct Hit<T = ()> {
    pub slot: usize,
    pub data: T,
}

LayerData covers the read path only (open + read). Build signatures differ between modes and are not in the trait.

Type	`Item`	Description
`()`	`()`	mode 1 — membership only
`PersistentCompactIntMatrix`	`Box<[u32]>`	mode 2 — count matrix (one u32 per column per slot)
`PersistentBitMatrix`	`Box<[bool]>`	mode 3 — presence matrix (one bit per genome per slot)

Build signatures:

// mode 1
impl Layer<()> {
    pub fn build(out_dir: &Path) -> OLMResult<usize>
}

// mode 2
impl Layer<PersistentCompactIntMatrix> {
    pub fn build(out_dir: &Path, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
    pub fn build_from_map(out_dir: &Path, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
}

// mode 3
impl Layer<PersistentBitMatrix> {
    pub fn build_presence(
        out_dir: &Path,
        n_genomes: usize,
        present_in: impl Fn(CanonicalKmer, usize) -> bool,
    ) -> OLMResult<usize>
}

All build impls delegate MPHF + evidence construction to MphfLayer::build via a mode-specific fill_slot callback. Mode 2 pre-reads n_kmers from unitigs.bin to size the PersistentCompactIntMatrixBuilder before calling MphfLayer::build. Mode 3 does the same for PersistentBitMatrixBuilder.

LayeredStore<S> and aggregation traits

LayeredStore<S> is a generic aggregation wrapper over Vec<S>. It propagates three traits from obicompactvec::traits up the hierarchy via blanket impls:

pub struct LayeredStore<S>(pub Vec<S>);

impl<S: ColumnWeights> ColumnWeights for LayeredStore<S> { … }  // Σ col_weights across inner stores
impl<S: CountPartials> CountPartials  for LayeredStore<S> { … }  // element-wise Σ partials
impl<S: BitPartials>   BitPartials    for LayeredStore<S> { … }  // element-wise Σ partials

Because blanket impls compose, LayeredStore<LayeredStore<S>> automatically inherits all three traits when S does — providing the partitioned level without a separate type.

Aggregation hierarchy:

PersistentCompactIntMatrix                  implements CountPartials
LayeredStore<PersistentCompactIntMatrix>         via blanket impl  (one partition)
LayeredStore<LayeredStore<…>>                    via blanket impl  (partitioned index)

Leaf implementors (in obicompactvec):

Type	Traits
`PersistentCompactIntMatrix`	`ColumnWeights` (via `sum()`) + `CountPartials`
`PersistentBitMatrix`	`ColumnWeights` (via `count_ones()`) + `BitPartials`

PersistentCompactIntVec and PersistentBitVec do not implement these traits — they are single-column primitives, not matrix-level aggregators.

See Kmer index architecture for the full trait API and the two-pass normalised-metric pattern.

On-disk structure

index_root/                        ← LayeredMap (collection)
  meta.json
  part_00000/                      ← Partition
    layer_0/                       ← Layer
      mphf.bin           — ptr_hash MPHF (epserde format)
      unitigs.bin        — packed 2-bit nucleotide sequences
      unitigs.bin.idx    — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
      evidence.bin       — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
      counts/            [mode 2] PersistentCompactIntMatrix
        meta.json          {"n": N, "n_cols": 1}
        col_000000.pciv
      presence/          [mode 3] PersistentBitMatrix
        meta.json          {"n": N, "n_cols": G}
        col_000000.pbiv
        …
    layer_1/
      …
  part_00001/
    …

Partition (part_XXXXX/): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.

Layer (layer_N/): one MphfLayer plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.

Evidence encoding

evidence.bin is a flat [u32; n] array with no header. Each u32 encodes one slot:

bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
bits [6:0]  = rank     (7 bits)  — kmer index within the chunk (0-based)

Decoding: chunk_id = raw >> 7, rank = raw & 0x7F. Reconstructing the kmer: read k nucleotides at position rank within unitig chunk_id.

For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.

ptr_hash configuration

type Mphf = PtrHash<
    u64,                              // key type: canonical kmer raw encoding
    CubicEps,                         // bucket fn: 2.4 bits/key, λ=3.5, α=0.99
    CachelineEfVec<Vec<CachelineEf>>, // remap: 11.6 bits/entry (Elias-Fano)
    Xx64,                             // hasher: XXH3-64 with seed
    Vec<u8>,                          // pilots
>;

Xx64 is chosen over FxHash because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.

CubicEps with PtrHashParams::<CubicEps>::default() (λ=3.5) is a balanced tradeoff: 2× slower construction than Linear/λ=3.0, 20% less space.

Query path

pub fn query(&self, kmer: CanonicalKmer) -> Option<Hit<D::Item>> {
    self.mphf.find(kmer).map(|slot| Hit { slot, data: self.data.read(slot) })
}

MphfLayer::find probes the MPHF, decodes evidence, and verifies the kmer — returning Some(slot) on match, None otherwise. data.read(slot) is called only on a confirmed hit.

In LayeredMap, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.

Add-layer algorithm

When adding dataset B to an existing index:

For each partition, probe existing layers for kmers of B routed to that partition.
Collect kmers absent from all layers → B \ index.
Write B \ index to a new unitigs.bin via MphfLayer::unitig_writer.
Call Layer<D>::build on the new directory.
Update meta.json.

Each partition's new layer is built independently; the operation is fully parallel across partitions.

Dependencies

crate	role
`ptr_hash 1.1`	MPHF per layer
`cacheline-ef 1.1`	compact remap inside ptr_hash
`epserde 0.8`	zero-copy MPHF serialisation
`memmap2 0.9`	mmap of evidence and payload files
`obiskio`	unitig file writer/reader
`obicompactvec`	payload types + aggregation traits
`rayon 1`	parallel MPHF construction pass
`ndarray 0.16`	aggregation output arrays

Column append and merge support

These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the merge command.

Matrix column append

impl PersistentCompactIntMatrix {
    pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

impl PersistentBitMatrix {
    pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}

Both methods write a new column file (col_NNNNNN.pciv / col_NNNNNN.pbiv) and update meta.json to increment n_cols. The value_of closure is called once per slot (indexed 0..n) to populate the column. The matrix n (row count) is read from the existing meta.json and must not change.

Presence matrix initialisation

impl Layer<()> {
    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}

Called on the first merge of a Presence-mode index. Creates the presence/ subdirectory with meta.json {"n": n_kmers, "n_cols": 1} and col_000000.pbiv set entirely to true. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.

Layer-level genome column append

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(
        layer_dir: &Path,
        value_of: impl Fn(usize) -> bool,
    ) -> OLMResult<()>
}

impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(
        layer_dir: &Path,
        value_of: impl Fn(usize) -> u32,
    ) -> OLMResult<()>
}

These delegate directly to the corresponding PersistentBitMatrix::append_column / PersistentCompactIntMatrix::append_column. They are typed at the Layer level to make call sites mode-aware without exposing the inner matrix path construction.

Why the MPHF is never rebuilt

The MPHF (mphf.bin, evidence.bin, unitigs.bin) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new .pciv/.pbiv file and a single meta.json update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.

12 KiB Raw Blame History Unescape Escape