036d044291
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
197 lines
8.3 KiB
Markdown
197 lines
8.3 KiB
Markdown
# Merge command
|
|
|
|
## Purpose
|
|
|
|
`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
|
|
|
|
---
|
|
|
|
## Modes
|
|
|
|
```rust
|
|
pub enum MergeMode { Presence, Count }
|
|
```
|
|
|
|
Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
|
|
|
|
| Mode | Column type | Constraint |
|
|
|---|---|---|
|
|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
|
|
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
|
|
|
|
---
|
|
|
|
## Input / output constraints
|
|
|
|
All source indexes must satisfy:
|
|
|
|
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
|
|
- Same `kmer_size`, `minimizer_size`, `n_partitions`
|
|
- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
|
|
- If `Count` mode: all sources must have `with_counts=true`
|
|
|
|
`--force`: if the output directory already exists, it is deleted before the merge begins.
|
|
|
|
---
|
|
|
|
## Evidence compatibility
|
|
|
|
`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:
|
|
|
|
- All `Exact` → accepted, output uses `Exact`
|
|
- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
|
|
- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first
|
|
|
|
Mixed exact/approx is a hard error, not a silent conversion.
|
|
|
|
```rust
|
|
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
|
|
```
|
|
|
|
---
|
|
|
|
## Genome label deduplication
|
|
|
|
`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.
|
|
|
|
---
|
|
|
|
## Algorithm
|
|
|
|
### 1. Validation
|
|
|
|
Check all sources against the constraints above. Abort on any mismatch.
|
|
|
|
### 2. Bootstrap output from first source
|
|
|
|
Recursive file copy of `sources[0]` → `output`. Immediately after the copy:
|
|
|
|
- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
|
|
- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
|
|
- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.
|
|
|
|
This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.
|
|
|
|
### 3. For each subsequent source (parallel across partitions)
|
|
|
|
`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.
|
|
|
|
Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).
|
|
|
|
**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).
|
|
|
|
**Pass 1 — classify kmers**
|
|
|
|
Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
|
|
|
- **Hit**: kmer already in the destination; record for Pass 2.
|
|
- **Miss**: push kmer into a `GraphDeBruijn` accumulator.
|
|
|
|
**New layer construction**
|
|
|
|
If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).
|
|
|
|
**Pass 2 — fill column builders**
|
|
|
|
For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:
|
|
|
|
- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
|
|
- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
|
|
- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`
|
|
|
|
Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.
|
|
|
|
**Column prepending for new layers**
|
|
|
|
Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
|
|
|
**Close and update metadata**
|
|
|
|
Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.
|
|
|
|
### 4. Update index metadata
|
|
|
|
`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.
|
|
|
|
---
|
|
|
|
## `append_genome_column`
|
|
|
|
Defined on two concrete specialisations of `Layer<D>`:
|
|
|
|
```rust
|
|
impl Layer<PersistentCompactIntMatrix> {
|
|
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
|
}
|
|
|
|
impl Layer<PersistentBitMatrix> {
|
|
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
|
}
|
|
```
|
|
|
|
Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.
|
|
|
|
---
|
|
|
|
## Column count invariant
|
|
|
|
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
|
|
|
|
Maintained by three mechanisms:
|
|
|
|
1. **Existing layers**: `n_src_total` columns appended (one per source genome).
|
|
2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
|
|
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.
|
|
|
|
The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
|
|
|
---
|
|
|
|
## Error variants relevant to merge
|
|
|
|
| Variant | Condition |
|
|
|---|---|
|
|
| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
|
|
| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
|
|
| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
|
|
| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
|
|
| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |
|
|
|
|
---
|
|
|
|
## On-disk impact
|
|
|
|
After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):
|
|
|
|
```
|
|
partitions/
|
|
part_00000/
|
|
index/
|
|
meta.json ← n_layers updated if new layer added
|
|
layer_0/
|
|
mphf.bin ← unchanged
|
|
unitigs.bin ← unchanged
|
|
evidence.bin ← unchanged
|
|
presence/ ← created on first merge (Presence mode)
|
|
meta.json {"n": N, "n_cols": G}
|
|
col_000000.pbiv ← all-true (genome 0 … G0-1)
|
|
col_000001.pbiv ← next source
|
|
...
|
|
counts/ ← extended (Count mode)
|
|
meta.json {"n": N, "n_cols": G}
|
|
col_000000.pciv ← genome 0 counts (from original build)
|
|
col_000001.pciv ← next source
|
|
...
|
|
layer_N/ ← new layer (if new kmers found)
|
|
mphf.bin
|
|
unitigs.bin
|
|
evidence.bin
|
|
presence/ or counts/
|
|
meta.json {"n": N1, "n_cols": G}
|
|
col_000000.pbiv ← all-false (absent for existing genomes)
|
|
...
|
|
spectrums/
|
|
<label>.json ← one file per genome, rebuilt from all sources
|
|
index.meta ← complete genome list + evidence kind written at bootstrap
|
|
```
|