feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
This commit is contained in:
@@ -0,0 +1,133 @@
|
||||
# Merge command
|
||||
|
||||
## Purpose
|
||||
|
||||
`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
|
||||
|
||||
---
|
||||
|
||||
## Modes
|
||||
|
||||
```rust
|
||||
pub enum MergeMode { Presence, Count }
|
||||
```
|
||||
|
||||
Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
|
||||
|
||||
| Mode | Column type | Constraint |
|
||||
|---|---|---|
|
||||
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
|
||||
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
|
||||
|
||||
---
|
||||
|
||||
## Input / output constraints
|
||||
|
||||
All source indexes must satisfy:
|
||||
|
||||
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
|
||||
- Same `kmer_size`, `minimizer_size`, `n_bits`
|
||||
- If `Count` mode: all sources must have `with_counts=true`
|
||||
|
||||
`--force`: if the output directory already exists, it is deleted before the merge begins.
|
||||
|
||||
---
|
||||
|
||||
## Algorithm
|
||||
|
||||
### 1. Validation
|
||||
|
||||
Check all sources against the constraints above. Abort on any mismatch.
|
||||
|
||||
### 2. Bootstrap output from first source
|
||||
|
||||
Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
|
||||
|
||||
### 3. For each subsequent source (parallel across partitions)
|
||||
|
||||
For each partition, process one source at a time sequentially:
|
||||
|
||||
**a. Classify kmers**
|
||||
|
||||
Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
||||
|
||||
- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
|
||||
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
|
||||
|
||||
**b. Extend existing layers**
|
||||
|
||||
For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
|
||||
|
||||
If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
|
||||
|
||||
**c. Build new layer for new kmers**
|
||||
|
||||
If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
||||
|
||||
**d. Update partition metadata**
|
||||
|
||||
Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
|
||||
|
||||
### 4. Update index metadata
|
||||
|
||||
Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
|
||||
|
||||
---
|
||||
|
||||
## Column count invariant
|
||||
|
||||
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
|
||||
|
||||
This is maintained by three mechanisms:
|
||||
|
||||
1. **Existing layers**: one column appended per source genome (`append_genome_column`).
|
||||
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
|
||||
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
|
||||
|
||||
The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
||||
|
||||
---
|
||||
|
||||
## New layer construction
|
||||
|
||||
All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
|
||||
|
||||
De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
|
||||
|
||||
---
|
||||
|
||||
## On-disk impact
|
||||
|
||||
After merging `G` genomes (1 existing + G-1 new sources):
|
||||
|
||||
```
|
||||
partitions/
|
||||
part_00000/
|
||||
index/
|
||||
meta.json ← n_layers updated if new layer added
|
||||
layer_0/
|
||||
mphf.bin ← unchanged (MPHF immutable)
|
||||
unitigs.bin ← unchanged
|
||||
evidence.bin ← unchanged
|
||||
presence/ ← created on first merge (Presence mode)
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv ← all-true (genome 0)
|
||||
col_000001.pbiv ← source 1 presence
|
||||
...
|
||||
counts/ ← extended (Count mode)
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pciv ← genome 0 counts (from original build)
|
||||
col_000001.pciv ← source 1 counts
|
||||
...
|
||||
layer_1/ ← new layer (if new kmers found)
|
||||
mphf.bin
|
||||
unitigs.bin
|
||||
evidence.bin
|
||||
presence/ or counts/
|
||||
meta.json {"n": N1, "n_cols": G}
|
||||
col_000000.pbiv ← all-false (genome 0 absent from this layer)
|
||||
...
|
||||
col_000001.pbiv ← source 1 presence in this new layer
|
||||
```
|
||||
|
||||
The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
|
||||
@@ -255,3 +255,57 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `obicompactvec` | payload types + aggregation traits |
|
||||
| `rayon 1` | parallel MPHF construction pass |
|
||||
| `ndarray 0.16` | aggregation output arrays |
|
||||
|
||||
---
|
||||
|
||||
## Column append and merge support
|
||||
|
||||
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
|
||||
|
||||
### Matrix column append
|
||||
|
||||
```rust
|
||||
impl PersistentCompactIntMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl PersistentBitMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
|
||||
|
||||
### Presence matrix initialisation
|
||||
|
||||
```rust
|
||||
impl Layer<()> {
|
||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
|
||||
|
||||
### Layer-level genome column append
|
||||
|
||||
```rust
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> bool,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> u32,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
|
||||
|
||||
### Why the MPHF is never rebuilt
|
||||
|
||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
|
||||
|
||||
Reference in New Issue
Block a user