docmd/implementation/merge.md

# Merge command

## Purpose

`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.

---

## Modes

```rust
pub enum MergeMode { Presence, Count }
```

Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.

| Mode | Column type | Constraint |
|---|---|---|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |

---

## Input / output constraints

All source indexes must satisfy:

- `IndexState::Indexed` (fully built — `index.done` sentinel present)
- Same `kmer_size`, `minimizer_size`, `n_bits`
- If `Count` mode: all sources must have `with_counts=true`

`--force`: if the output directory already exists, it is deleted before the merge begins.

---

## Algorithm

### 1. Validation

Check all sources against the constraints above. Abort on any mismatch.

### 2. Bootstrap output from first source

Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.

### 3. For each subsequent source (parallel across partitions)

For each partition, process one source at a time sequentially:

**a. Classify kmers**

Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:

- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.

**b. Extend existing layers**

For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).

If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).

**c. Build new layer for new kmers**

If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.

**d. Update partition metadata**

Write updated `meta.json` with the incremented `n_layers` if a new layer was added.

### 4. Update index metadata

Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.

---

## Column count invariant

After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.

This is maintained by three mechanisms:

1. **Existing layers**: one column appended per source genome (`append_genome_column`).
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.

The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.

---

## New layer construction

All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.

De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.

---

## On-disk impact

After merging `G` genomes (1 existing + G-1 new sources):

```
partitions/
  part_00000/
    index/
      meta.json              ← n_layers updated if new layer added
      layer_0/
        mphf.bin             ← unchanged (MPHF immutable)
        unitigs.bin          ← unchanged
        evidence.bin         ← unchanged
        presence/            ← created on first merge (Presence mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pbiv      ← all-true (genome 0)
          col_000001.pbiv      ← source 1 presence
          ...
        counts/              ← extended (Count mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pciv      ← genome 0 counts (from original build)
          col_000001.pciv      ← source 1 counts
          ...
      layer_1/               ← new layer (if new kmers found)
        mphf.bin
        unitigs.bin
        evidence.bin
        presence/ or counts/
          meta.json            {"n": N1, "n_cols": G}
          col_000000.pbiv      ← all-false (genome 0 absent from this layer)
          ...
          col_000001.pbiv      ← source 1 presence in this new layer
```

The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`# Merge command`

			`## Purpose`

			`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.

			`---`

			`## Modes`

			```rust
			`pub enum MergeMode { Presence, Count }`
			```

			Default mode is `Presence`. `Count` mode requires all source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.

			`\| Mode \| Column type \| Constraint \|`
			`\|---\|---\|---\|`
			\| `Presence` \| `PersistentBitMatrix` (one bit per genome per slot) \| none \|
			\| `Count` \| `PersistentCompactIntMatrix` (one u32 per genome per slot) \| all sources `with_counts=true` \|

			`---`

			`## Input / output constraints`

			`All source indexes must satisfy:`

			- `IndexState::Indexed` (fully built — `index.done` sentinel present)
			- Same `kmer_size`, `minimizer_size`, `n_bits`
			- If `Count` mode: all sources must have `with_counts=true`

			`--force`: if the output directory already exists, it is deleted before the merge begins.

			`---`

			`## Algorithm`

			`### 1. Validation`

			`Check all sources against the constraints above. Abort on any mismatch.`

			`### 2. Bootstrap output from first source`

			Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.

			`### 3. For each subsequent source (parallel across partitions)`

			`For each partition, process one source at a time sequentially:`

			`a. Classify kmers`

			Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:

			- Hit `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
			- Miss: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.

			`b. Extend existing layers`

			For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).

			If this is the first merge and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).

			`c. Build new layer for new kmers`

			If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.

			`d. Update partition metadata`

			Write updated `meta.json` with the incremented `n_layers` if a new layer was added.

			`### 4. Update index metadata`

			Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.

			`---`

			`## Column count invariant`

			After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the total genome count in the index at that point.

			`This is maintained by three mechanisms:`

			1. Existing layers: one column appended per source genome (`append_genome_column`).
			2. New layers created during merge: `n_existing_genomes` absent columns prepended before the source's own column.
			3. First merge, Presence mode: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.

			The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.

			`---`

			`## New layer construction`

			All kmers absent from the destination index — across all sources being merged — accumulate into a single `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.

			`De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.`

			`---`

			`## On-disk impact`

			After merging `G` genomes (1 existing + G-1 new sources):

			```
			`partitions/`
			`part_00000/`
			`index/`
			`meta.json ← n_layers updated if new layer added`
			`layer_0/`
			`mphf.bin ← unchanged (MPHF immutable)`
			`unitigs.bin ← unchanged`
			`evidence.bin ← unchanged`
			`presence/ ← created on first merge (Presence mode)`
			`meta.json {"n": N, "n_cols": G}`
			`col_000000.pbiv ← all-true (genome 0)`
			`col_000001.pbiv ← source 1 presence`
			`...`
			`counts/ ← extended (Count mode)`
			`meta.json {"n": N, "n_cols": G}`
			`col_000000.pciv ← genome 0 counts (from original build)`
			`col_000001.pciv ← source 1 counts`
			`...`
			`layer_1/ ← new layer (if new kmers found)`
			`mphf.bin`
			`unitigs.bin`
			`evidence.bin`
			`presence/ or counts/`
			`meta.json {"n": N1, "n_cols": G}`
			`col_000000.pbiv ← all-false (genome 0 absent from this layer)`
			`...`
			`col_000001.pbiv ← source 1 presence in this new layer`
			```

			The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.