feat: add merge command to consolidate k-mer indexes

Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
This commit is contained in:
Eric Coissac
2026-05-21 05:53:55 +02:00
parent bfa436ad15
commit e1d59fde54
17 changed files with 799 additions and 8 deletions
+54
View File
@@ -255,3 +255,57 @@ Each partition's new layer is built independently; the operation is fully parall
| `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass |
| `ndarray 0.16` | aggregation output arrays |
---
## Column append and merge support
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
### Matrix column append
```rust
impl PersistentCompactIntMatrix {
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
impl PersistentBitMatrix {
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
```
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
### Presence matrix initialisation
```rust
impl Layer<()> {
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
}
```
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
### Layer-level genome column append
```rust
impl Layer<PersistentBitMatrix> {
pub fn append_genome_column(
layer_dir: &Path,
value_of: impl Fn(usize) -> bool,
) -> OLMResult<()>
}
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(
layer_dir: &Path,
value_of: impl Fn(usize) -> u32,
) -> OLMResult<()>
}
```
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
### Why the MPHF is never rebuilt
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.