feat: add merge command to consolidate k-mer indexes

Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
2026-05-21 05:53:55 +02:00
parent bfa436ad15
commit e1d59fde54
17 changed files with 799 additions and 8 deletions
@@ -0,0 +1,133 @@
+# Merge command
+
+## Purpose
+
+`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
+
+---
+
+## Modes
+
+```rust
+pub enum MergeMode { Presence, Count }
+```
+
+Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
+
+| Mode | Column type | Constraint |
+|---|---|---|
+| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
+| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
+
+---
+
+## Input / output constraints
+
+All source indexes must satisfy:
+
+- `IndexState::Indexed` (fully built — `index.done` sentinel present)
+- Same `kmer_size`, `minimizer_size`, `n_bits`
+- If `Count` mode: all sources must have `with_counts=true`
+
+`--force`: if the output directory already exists, it is deleted before the merge begins.
+
+---
+
+## Algorithm
+
+### 1. Validation
+
+Check all sources against the constraints above. Abort on any mismatch.
+
+### 2. Bootstrap output from first source
+
+Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
+
+### 3. For each subsequent source (parallel across partitions)
+
+For each partition, process one source at a time sequentially:
+
+**a. Classify kmers**
+
+Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
+
+- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
+- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
+
+**b. Extend existing layers**
+
+For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
+
+If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
+
+**c. Build new layer for new kmers**
+
+If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
+
+**d. Update partition metadata**
+
+Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
+
+### 4. Update index metadata
+
+Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
+
+---
+
+## Column count invariant
+
+After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
+
+This is maintained by three mechanisms:
+
+1. **Existing layers**: one column appended per source genome (`append_genome_column`).
+2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
+3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
+
+The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
+
+---
+
+## New layer construction
+
+All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
+
+De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
+
+---
+
+## On-disk impact
+
+After merging `G` genomes (1 existing + G-1 new sources):
+
+```
+partitions/
+  part_00000/
+    index/
+      meta.json              ← n_layers updated if new layer added
+      layer_0/
+        mphf.bin             ← unchanged (MPHF immutable)
+        unitigs.bin          ← unchanged
+        evidence.bin         ← unchanged
+        presence/            ← created on first merge (Presence mode)
+          meta.json            {"n": N, "n_cols": G}
+          col_000000.pbiv      ← all-true (genome 0)
+          col_000001.pbiv      ← source 1 presence
+          ...
+        counts/              ← extended (Count mode)
+          meta.json            {"n": N, "n_cols": G}
+          col_000000.pciv      ← genome 0 counts (from original build)
+          col_000001.pciv      ← source 1 counts
+          ...
+      layer_1/               ← new layer (if new kmers found)
+        mphf.bin
+        unitigs.bin
+        evidence.bin
+        presence/ or counts/
+          meta.json            {"n": N1, "n_cols": G}
+          col_000000.pbiv      ← all-false (genome 0 absent from this layer)
+          ...
+          col_000001.pbiv      ← source 1 presence in this new layer
+```
+
+The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
@@ -255,3 +255,57 @@ Each partition's new layer is built independently; the operation is fully parall
 | `obicompactvec` | payload types + aggregation traits |
 | `rayon 1` | parallel MPHF construction pass |
 | `ndarray 0.16` | aggregation output arrays |
+
+---
+
+## Column append and merge support
+
+These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
+
+### Matrix column append
+
+```rust
+impl PersistentCompactIntMatrix {
+    pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
+}
+
+impl PersistentBitMatrix {
+    pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
+}
+```
+
+Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
+
+### Presence matrix initialisation
+
+```rust
+impl Layer<()> {
+    pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
+}
+```
+
+Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
+
+### Layer-level genome column append
+
+```rust
+impl Layer<PersistentBitMatrix> {
+    pub fn append_genome_column(
+        layer_dir: &Path,
+        value_of: impl Fn(usize) -> bool,
+    ) -> OLMResult<()>
+}
+
+impl Layer<PersistentCompactIntMatrix> {
+    pub fn append_genome_column(
+        layer_dir: &Path,
+        value_of: impl Fn(usize) -> u32,
+    ) -> OLMResult<()>
+}
+```
+
+These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
+
+### Why the MPHF is never rebuilt
+
+The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.