obikmer/docmd/implementation/merge.md

# Merge command

## Purpose

`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.

---

## Modes

```rust
pub enum MergeMode { Presence, Count }
```

Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.

| Mode | Column type | Constraint |
|---|---|---|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |

---

## Input / output constraints

All source indexes must satisfy:

- `IndexState::Indexed` (fully built — `index.done` sentinel present)
- Same `kmer_size`, `minimizer_size`, `n_partitions`
- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
- If `Count` mode: all sources must have `with_counts=true`

`--force`: if the output directory already exists, it is deleted before the merge begins.

---

## Evidence compatibility

`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:

- All `Exact` → accepted, output uses `Exact`
- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first

Mixed exact/approx is a hard error, not a silent conversion.

```rust
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
```

---

## Genome label deduplication

`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.

---

## Algorithm

### 1. Validation

Check all sources against the constraints above. Abort on any mismatch.

### 2. Bootstrap output from first source

Recursive file copy of `sources[0]` → `output`. Immediately after the copy:

- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.

This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.

### 3. For each subsequent source (parallel across partitions)

`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.

Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).

**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).

**Pass 1 — classify kmers**

Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:

- **Hit**: kmer already in the destination; record for Pass 2.
- **Miss**: push kmer into a `GraphDeBruijn` accumulator.

**New layer construction**

If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).

**Pass 2 — fill column builders**

For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:

- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`

Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.

**Column prepending for new layers**

Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.

**Close and update metadata**

Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.

### 4. Update index metadata

`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.

---

## `append_genome_column`

Defined on two concrete specialisations of `Layer<D>`:

```rust
impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
```

Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.

---

## Column count invariant

After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.

Maintained by three mechanisms:

1. **Existing layers**: `n_src_total` columns appended (one per source genome).
2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.

The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.

---

## Error variants relevant to merge

| Variant | Condition |
|---|---|
| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |

---

## On-disk impact

After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):

```
partitions/
  part_00000/
    index/
      meta.json              ← n_layers updated if new layer added
      layer_0/
        mphf.bin             ← unchanged
        unitigs.bin          ← unchanged
        evidence.bin         ← unchanged
        presence/            ← created on first merge (Presence mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pbiv      ← all-true (genome 0 … G0-1)
          col_000001.pbiv      ← next source
          ...
        counts/              ← extended (Count mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pciv      ← genome 0 counts (from original build)
          col_000001.pciv      ← next source
          ...
      layer_N/               ← new layer (if new kmers found)
        mphf.bin
        unitigs.bin
        evidence.bin
        presence/ or counts/
          meta.json            {"n": N1, "n_cols": G}
          col_000000.pbiv      ← all-false (absent for existing genomes)
          ...
spectrums/
  <label>.json               ← one file per genome, rebuilt from all sources
index.meta                   ← complete genome list + evidence kind written at bootstrap
```