134 lines
5.5 KiB
Markdown
134 lines
5.5 KiB
Markdown
|
|
# Merge command
|
||
|
|
|
||
|
|
## Purpose
|
||
|
|
|
||
|
|
`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Modes
|
||
|
|
|
||
|
|
```rust
|
||
|
|
pub enum MergeMode { Presence, Count }
|
||
|
|
```
|
||
|
|
|
||
|
|
Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
|
||
|
|
|
||
|
|
| Mode | Column type | Constraint |
|
||
|
|
|---|---|---|
|
||
|
|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
|
||
|
|
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Input / output constraints
|
||
|
|
|
||
|
|
All source indexes must satisfy:
|
||
|
|
|
||
|
|
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
|
||
|
|
- Same `kmer_size`, `minimizer_size`, `n_bits`
|
||
|
|
- If `Count` mode: all sources must have `with_counts=true`
|
||
|
|
|
||
|
|
`--force`: if the output directory already exists, it is deleted before the merge begins.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Algorithm
|
||
|
|
|
||
|
|
### 1. Validation
|
||
|
|
|
||
|
|
Check all sources against the constraints above. Abort on any mismatch.
|
||
|
|
|
||
|
|
### 2. Bootstrap output from first source
|
||
|
|
|
||
|
|
Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
|
||
|
|
|
||
|
|
### 3. For each subsequent source (parallel across partitions)
|
||
|
|
|
||
|
|
For each partition, process one source at a time sequentially:
|
||
|
|
|
||
|
|
**a. Classify kmers**
|
||
|
|
|
||
|
|
Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
||
|
|
|
||
|
|
- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
|
||
|
|
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
|
||
|
|
|
||
|
|
**b. Extend existing layers**
|
||
|
|
|
||
|
|
For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
|
||
|
|
|
||
|
|
If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
|
||
|
|
|
||
|
|
**c. Build new layer for new kmers**
|
||
|
|
|
||
|
|
If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
||
|
|
|
||
|
|
**d. Update partition metadata**
|
||
|
|
|
||
|
|
Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
|
||
|
|
|
||
|
|
### 4. Update index metadata
|
||
|
|
|
||
|
|
Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Column count invariant
|
||
|
|
|
||
|
|
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
|
||
|
|
|
||
|
|
This is maintained by three mechanisms:
|
||
|
|
|
||
|
|
1. **Existing layers**: one column appended per source genome (`append_genome_column`).
|
||
|
|
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
|
||
|
|
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
|
||
|
|
|
||
|
|
The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## New layer construction
|
||
|
|
|
||
|
|
All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
|
||
|
|
|
||
|
|
De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## On-disk impact
|
||
|
|
|
||
|
|
After merging `G` genomes (1 existing + G-1 new sources):
|
||
|
|
|
||
|
|
```
|
||
|
|
partitions/
|
||
|
|
part_00000/
|
||
|
|
index/
|
||
|
|
meta.json ← n_layers updated if new layer added
|
||
|
|
layer_0/
|
||
|
|
mphf.bin ← unchanged (MPHF immutable)
|
||
|
|
unitigs.bin ← unchanged
|
||
|
|
evidence.bin ← unchanged
|
||
|
|
presence/ ← created on first merge (Presence mode)
|
||
|
|
meta.json {"n": N, "n_cols": G}
|
||
|
|
col_000000.pbiv ← all-true (genome 0)
|
||
|
|
col_000001.pbiv ← source 1 presence
|
||
|
|
...
|
||
|
|
counts/ ← extended (Count mode)
|
||
|
|
meta.json {"n": N, "n_cols": G}
|
||
|
|
col_000000.pciv ← genome 0 counts (from original build)
|
||
|
|
col_000001.pciv ← source 1 counts
|
||
|
|
...
|
||
|
|
layer_1/ ← new layer (if new kmers found)
|
||
|
|
mphf.bin
|
||
|
|
unitigs.bin
|
||
|
|
evidence.bin
|
||
|
|
presence/ or counts/
|
||
|
|
meta.json {"n": N1, "n_cols": G}
|
||
|
|
col_000000.pbiv ← all-false (genome 0 absent from this layer)
|
||
|
|
...
|
||
|
|
col_000001.pbiv ← source 1 presence in this new layer
|
||
|
|
```
|
||
|
|
|
||
|
|
The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
|