docmd/implementation/merge.md

# Merge command

## Purpose

`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.

---

## Modes

```rust
pub enum MergeMode { Presence, Count }
```

Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.

| Mode | Column type | Constraint |
|---|---|---|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |

---

## Input / output constraints

All source indexes must satisfy:

- `IndexState::Indexed` (fully built — `index.done` sentinel present)
- Same `kmer_size`, `minimizer_size`, `n_partitions`
- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
- If `Count` mode: all sources must have `with_counts=true`

`--force`: if the output directory already exists, it is deleted before the merge begins.

---

## Evidence compatibility

`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:

- All `Exact` → accepted, output uses `Exact`
- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first

Mixed exact/approx is a hard error, not a silent conversion.

```rust
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
```

---

## Genome label deduplication

`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.

---

## Algorithm

### 1. Validation

Check all sources against the constraints above. Abort on any mismatch.

### 2. Bootstrap output from first source

Recursive file copy of `sources[0]` → `output`. Immediately after the copy:

- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.

This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.

### 3. For each subsequent source (parallel across partitions)

`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.

Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).

**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).

**Pass 1 — classify kmers**

Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:

- **Hit**: kmer already in the destination; record for Pass 2.
- **Miss**: push kmer into a `GraphDeBruijn` accumulator.

**New layer construction**

If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).

**Pass 2 — fill column builders**

For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:

- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`

Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.

**Column prepending for new layers**

Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.

**Close and update metadata**

Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.

### 4. Update index metadata

`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.

---

## `append_genome_column`

Defined on two concrete specialisations of `Layer<D>`:

```rust
impl Layer<PersistentCompactIntMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}

impl Layer<PersistentBitMatrix> {
    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
```

Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.

---

## Column count invariant

After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.

Maintained by three mechanisms:

1. **Existing layers**: `n_src_total` columns appended (one per source genome).
2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.

The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.

---

## Error variants relevant to merge

| Variant | Condition |
|---|---|
| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |

---

## On-disk impact

After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):

```
partitions/
  part_00000/
    index/
      meta.json              ← n_layers updated if new layer added
      layer_0/
        mphf.bin             ← unchanged
        unitigs.bin          ← unchanged
        evidence.bin         ← unchanged
        presence/            ← created on first merge (Presence mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pbiv      ← all-true (genome 0 … G0-1)
          col_000001.pbiv      ← next source
          ...
        counts/              ← extended (Count mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pciv      ← genome 0 counts (from original build)
          col_000001.pciv      ← next source
          ...
      layer_N/               ← new layer (if new kmers found)
        mphf.bin
        unitigs.bin
        evidence.bin
        presence/ or counts/
          meta.json            {"n": N1, "n_cols": G}
          col_000000.pbiv      ← all-false (absent for existing genomes)
          ...
spectrums/
  <label>.json               ← one file per genome, rebuilt from all sources
index.meta                   ← complete genome list + evidence kind written at bootstrap
```
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`# Merge command`

			`## Purpose`

			`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.

			`---`

			`## Modes`

			```rust
			`pub enum MergeMode { Presence, Count }`
			```

			Default mode is `Presence`. `Count` mode requires all source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.

			`\| Mode \| Column type \| Constraint \|`
			`\|---\|---\|---\|`
			\| `Presence` \| `PersistentBitMatrix` (one bit per genome per slot) \| none \|
			\| `Count` \| `PersistentCompactIntMatrix` (one u32 per genome per slot) \| all sources `with_counts=true` \|

			`---`

			`## Input / output constraints`

			`All source indexes must satisfy:`

			- `IndexState::Indexed` (fully built — `index.done` sentinel present)
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			- Same `kmer_size`, `minimizer_size`, `n_partitions`
			- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			- If `Count` mode: all sources must have `with_counts=true`

			`--force`: if the output directory already exists, it is deleted before the merge begins.

			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`## Evidence compatibility`

			`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:

			- All `Exact` → accepted, output uses `Exact`
			- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
			- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first

			`Mixed exact/approx is a hard error, not a silent conversion.`

			```rust
			`fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>`
			```

			`---`

			`## Genome label deduplication`

			`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.

			`---`

feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`## Algorithm`

			`### 1. Validation`

			`Check all sources against the constraints above. Abort on any mismatch.`

			`### 2. Bootstrap output from first source`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Recursive file copy of `sources[0]` → `output`. Immediately after the copy:

			- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
			- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
			- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.

			This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			`### 3. For each subsequent source (parallel across partitions)`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.

			Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			First merge, Presence mode: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`Pass 1 — classify kmers`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`- Hit: kmer already in the destination; record for Pass 2.`
			- Miss: push kmer into a `GraphDeBruijn` accumulator.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`New layer construction`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across all sources — accumulate into a single graph, producing one new layer per merge operation (not one per source).
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`Pass 2 — fill column builders`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
			- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
			- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.

			`Column prepending for new layers`

			Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.

			`Close and update metadata`

			Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			`### 4. Update index metadata`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.

			`---`

			## `append_genome_column`

			Defined on two concrete specialisations of `Layer<D>`:

			```rust
			`impl Layer<PersistentCompactIntMatrix> {`
			`pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>`
			`}`

			`impl Layer<PersistentBitMatrix> {`
			`pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>`
			`}`
			```

			Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			`---`

			`## Column count invariant`

			After any merge, every layer in every partition has exactly `n_genomes` columns, where `n_genomes` is the total genome count in the index at that point.

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`Maintained by three mechanisms:`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			1. Existing layers: `n_src_total` columns appended (one per source genome).
			2. New layers created during merge: `n_dst_genomes` absent columns prepended before source columns.
			3. First merge, Presence mode: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`## Error variants relevant to merge`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`\| Variant \| Condition \|`
			`\|---\|---\|`
			\| `OKIError::NotIndexed(path)` \| Source not in `Indexed` state \|
			\| `OKIError::IncompatibleConfig` \| Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` \|
			\| `OKIError::MismatchedMode` \| Count mode but a source has `with_counts=false` \|
			\| `OKIError::IncompatibleEvidence(msg)` \| Mixed exact/approx or different approx `(b, z)` \|
			\| `OKIError::DuplicateGenomeLabel(label)` \| Duplicate label and `rename_duplicates=false` \|
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			`---`

			`## On-disk impact`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00
			```
			`partitions/`
			`part_00000/`
			`index/`
			`meta.json ← n_layers updated if new layer added`
			`layer_0/`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`mphf.bin ← unchanged`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`unitigs.bin ← unchanged`
			`evidence.bin ← unchanged`
			`presence/ ← created on first merge (Presence mode)`
			`meta.json {"n": N, "n_cols": G}`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`col_000000.pbiv ← all-true (genome 0 … G0-1)`
			`col_000001.pbiv ← next source`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`...`
			`counts/ ← extended (Count mode)`
			`meta.json {"n": N, "n_cols": G}`
			`col_000000.pciv ← genome 0 counts (from original build)`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`col_000001.pciv ← next source`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`...`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`layer_N/ ← new layer (if new kmers found)`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`mphf.bin`
			`unitigs.bin`
			`evidence.bin`
			`presence/ or counts/`
			`meta.json {"n": N1, "n_cols": G}`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`col_000000.pbiv ← all-false (absent for existing genomes)`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			`...`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`spectrums/`
			`<label>.json ← one file per genome, rebuilt from all sources`
			`index.meta ← complete genome list + evidence kind written at bootstrap`
feat: add merge command to consolidate k-mer indexes 2026-05-21 05:53:55 +02:00			```