feat: add merge command to consolidate k-mer indexes
Introduces a new `merge` CLI subcommand and underlying implementation to consolidate multiple pre-indexed k-mer indexes into a single output. Adds `append_column` methods to persistent bit and int matrices to enable incremental genome column expansion without rebuilding the MPHF. Includes new error variants for index readiness and configuration mismatches, adds a `--force` flag to the index command, and updates documentation and navigation structure accordingly.
This commit is contained in:
@@ -255,3 +255,57 @@ Each partition's new layer is built independently; the operation is fully parall
|
||||
| `obicompactvec` | payload types + aggregation traits |
|
||||
| `rayon 1` | parallel MPHF construction pass |
|
||||
| `ndarray 0.16` | aggregation output arrays |
|
||||
|
||||
---
|
||||
|
||||
## Column append and merge support
|
||||
|
||||
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
|
||||
|
||||
### Matrix column append
|
||||
|
||||
```rust
|
||||
impl PersistentCompactIntMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl PersistentBitMatrix {
|
||||
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
|
||||
|
||||
### Presence matrix initialisation
|
||||
|
||||
```rust
|
||||
impl Layer<()> {
|
||||
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
|
||||
|
||||
### Layer-level genome column append
|
||||
|
||||
```rust
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> bool,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn append_genome_column(
|
||||
layer_dir: &Path,
|
||||
value_of: impl Fn(usize) -> u32,
|
||||
) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
|
||||
|
||||
### Why the MPHF is never rebuilt
|
||||
|
||||
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
|
||||
|
||||
Reference in New Issue
Block a user