feat: add merge command to consolidate k-mer indexes

Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 05:53:55 +02:00
parent bfa436ad15
commit e1d59fde54
17 changed files with 799 additions and 8 deletions
@@ -0,0 +1,133 @@
+# Merge command
+
+## Purpose
+
+`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
+
+---
+
+## Modes
+
+```rust
+pub enum MergeMode { Presence, Count }
+```
+
+Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
+
+| Mode | Column type | Constraint |
+|---|---|---|
+| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
+| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
+
+---
+
+## Input / output constraints
+
+All source indexes must satisfy:
+
+- `IndexState::Indexed` (fully built — `index.done` sentinel present)
+- Same `kmer_size`, `minimizer_size`, `n_bits`
+- If `Count` mode: all sources must have `with_counts=true`
+
+`--force`: if the output directory already exists, it is deleted before the merge begins.
+
+---
+
+## Algorithm
+
+### 1. Validation
+
+Check all sources against the constraints above. Abort on any mismatch.
+
+### 2. Bootstrap output from first source
+
+Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
+
+### 3. For each subsequent source (parallel across partitions)
+
+For each partition, process one source at a time sequentially:
+
+**a. Classify kmers**
+
+Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
+
+- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
+- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
+
+**b. Extend existing layers**
+
+For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
+
+If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
+
+**c. Build new layer for new kmers**
+
+If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
+
+**d. Update partition metadata**
+
+Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
+
+### 4. Update index metadata
+
+Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
+
+---
+
+## Column count invariant
+
+After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
+
+This is maintained by three mechanisms:
+
+1. **Existing layers**: one column appended per source genome (`append_genome_column`).
+2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
+3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
+
+The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
+
+---
+
+## New layer construction
+
+All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
+
+De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
+
+---
+
+## On-disk impact
+
+After merging `G` genomes (1 existing + G-1 new sources):
+
+```
+partitions/
+  part_00000/
+    index/
+      meta.json              ← n_layers updated if new layer added
+      layer_0/
+        mphf.bin             ← unchanged (MPHF immutable)
+        unitigs.bin          ← unchanged
+        evidence.bin         ← unchanged
+        presence/            ← created on first merge (Presence mode)
+          meta.json            {"n": N, "n_cols": G}
+          col_000000.pbiv      ← all-true (genome 0)
+          col_000001.pbiv      ← source 1 presence
+          ...
+        counts/              ← extended (Count mode)
+          meta.json            {"n": N, "n_cols": G}
+          col_000000.pciv      ← genome 0 counts (from original build)
+          col_000001.pciv      ← source 1 counts
+          ...
+      layer_1/               ← new layer (if new kmers found)
+        mphf.bin
+        unitigs.bin
+        evidence.bin
+        presence/ or counts/
+          meta.json            {"n": N1, "n_cols": G}
+          col_000000.pbiv      ← all-false (genome 0 absent from this layer)
+          ...
+          col_000001.pbiv      ← source 1 presence in this new layer
+```
+
+The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.