Files
obikmer/docmd/implementation/merge.md
T
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

8.3 KiB

Merge command

Purpose

obikmer merge combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.


Modes

pub enum MergeMode { Presence, Count }

Default mode is Presence. Count mode requires all source indexes to have with_counts=true; mixing count and non-count sources is rejected at validation.

Mode Column type Constraint
Presence PersistentBitMatrix (one bit per genome per slot) none
Count PersistentCompactIntMatrix (one u32 per genome per slot) all sources with_counts=true

Input / output constraints

All source indexes must satisfy:

  • IndexState::Indexed (fully built — index.done sentinel present)
  • Same kmer_size, minimizer_size, n_partitions
  • Same evidence kind: all Exact, or all Approx with identical (b, z) parameters
  • If Count mode: all sources must have with_counts=true

--force: if the output directory already exists, it is deleted before the merge begins.


Evidence compatibility

validate_evidence_compat(sources) is called before any I/O. It compares each source's EvidenceKind against sources[0]:

  • All Exact → accepted, output uses Exact
  • All Approx { b, z } with same (b, z) → accepted, output uses those parameters
  • Any other combination → OKIError::IncompatibleEvidence, with a message directing the user to run reindex first

Mixed exact/approx is a hard error, not a silent conversion.

fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>

Genome label deduplication

compute_labels(sources, rename_duplicates) assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive .1, .2, … suffixes when rename_duplicates is true, or trigger OKIError::DuplicateGenomeLabel otherwise.


Algorithm

1. Validation

Check all sources against the constraints above. Abort on any mismatch.

2. Bootstrap output from first source

Recursive file copy of sources[0]output. Immediately after the copy:

  • index.meta is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
  • In Presence mode, any counts/ directories inherited from source_0 are removed.
  • spectrums/ from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.

This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … n_dst_genomes - 1 in the destination.

3. For each subsequent source (parallel across partitions)

KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits) is called for each partition index i. block_bits is taken from dst.meta.config.block_bits.

Each entry in sources is (&KmerPartition, n_genomes) where n_genomes is the column count that source contributes (> 1 when the source is itself a merged index).

First merge, Presence mode: when n_dst_genomes == 1, Layer::<()>::init_presence_matrix is called on every existing destination layer before any source column is appended. This creates presence/col_000000.pbiv set all-true (genome 0 is present in every slot).

Pass 1 — classify kmers

Iterate all kmers from all source partitions (via UnitigFileReader + canonical kmer iteration). For each kmer, probe the destination LayeredMap<()>:

  • Hit: kmer already in the destination; record for Pass 2.
  • Miss: push kmer into a GraphDeBruijn accumulator.

New layer construction

If the accumulator is non-empty, compute de Bruijn unitigs and call Layer::<()>::build(&new_layer_dir, block_bits). All kmers absent from the destination — across all sources — accumulate into a single graph, producing one new layer per merge operation (not one per source).

Pass 2 — fill column builders

For each source and each of its layers, re-iterate unitigs and look up stored values via SrcLayerData::lookup(kmer, src_n):

  • SrcLayerData::SetMembership — no data directory exists; every kmer returns vec![1; n_genomes]
  • SrcLayerData::Presence — reads PersistentBitMatrix from presence/
  • SrcLayerData::Count — reads PersistentCompactIntMatrix from counts/

Hits are routed to exist_builders[dst_layer][src_col]; misses are routed to new_src_builders[src_col].

Column prepending for new layers

Before source columns are written to the new layer, n_dst_genomes absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.

Close and update metadata

Close all builders; update presence/meta.json or counts/meta.json with {"n": N, "n_cols": n_dst_genomes + n_src_total}; increment PartitionMeta::n_layers if a new layer was added.

4. Update index metadata

index.meta was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.


append_genome_column

Defined on two concrete specialisations of Layer<D>:

impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}

Each appends one column file to the matrix subdirectory (counts/ or presence/). In merge_partition, columns are written directly via PersistentBitVecBuilder / PersistentCompactIntVecBuilder rather than through these helpers, but the invariant they enforce is the same.


Column count invariant

After any merge, every layer in every partition has exactly n_genomes columns, where n_genomes is the total genome count in the index at that point.

Maintained by three mechanisms:

  1. Existing layers: n_src_total columns appended (one per source genome).
  2. New layers created during merge: n_dst_genomes absent columns prepended before source columns.
  3. First merge, Presence mode: init_presence_matrix retroactively creates presence/col_0 all-true for genome 0.

The invariant is a precondition of LayeredStore aggregation traits: col_weights() and all partial distance methods assume every inner store has the same column count.


Error variants relevant to merge

Variant Condition
OKIError::NotIndexed(path) Source not in Indexed state
OKIError::IncompatibleConfig Mismatched kmer_size, minimizer_size, or n_partitions
OKIError::MismatchedMode Count mode but a source has with_counts=false
OKIError::IncompatibleEvidence(msg) Mixed exact/approx or different approx (b, z)
OKIError::DuplicateGenomeLabel(label) Duplicate label and rename_duplicates=false

On-disk impact

After merging G genomes (sources_0 contributes G0, subsequent sources the rest):

partitions/
  part_00000/
    index/
      meta.json              ← n_layers updated if new layer added
      layer_0/
        mphf.bin             ← unchanged
        unitigs.bin          ← unchanged
        evidence.bin         ← unchanged
        presence/            ← created on first merge (Presence mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pbiv      ← all-true (genome 0 … G0-1)
          col_000001.pbiv      ← next source
          ...
        counts/              ← extended (Count mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pciv      ← genome 0 counts (from original build)
          col_000001.pciv      ← next source
          ...
      layer_N/               ← new layer (if new kmers found)
        mphf.bin
        unitigs.bin
        evidence.bin
        presence/ or counts/
          meta.json            {"n": N1, "n_cols": G}
          col_000000.pbiv      ← all-false (absent for existing genomes)
          ...
spectrums/
  <label>.json               ← one file per genome, rebuilt from all sources
index.meta                   ← complete genome list + evidence kind written at bootstrap