refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
@@ -26,13 +26,36 @@ Default mode is `Presence`. `Count` mode requires **all** source indexes to have
|
||||
All source indexes must satisfy:
|
||||
|
||||
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
|
||||
- Same `kmer_size`, `minimizer_size`, `n_bits`
|
||||
- Same `kmer_size`, `minimizer_size`, `n_partitions`
|
||||
- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
|
||||
- If `Count` mode: all sources must have `with_counts=true`
|
||||
|
||||
`--force`: if the output directory already exists, it is deleted before the merge begins.
|
||||
|
||||
---
|
||||
|
||||
## Evidence compatibility
|
||||
|
||||
`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:
|
||||
|
||||
- All `Exact` → accepted, output uses `Exact`
|
||||
- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
|
||||
- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first
|
||||
|
||||
Mixed exact/approx is a hard error, not a silent conversion.
|
||||
|
||||
```rust
|
||||
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Genome label deduplication
|
||||
|
||||
`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.
|
||||
|
||||
---
|
||||
|
||||
## Algorithm
|
||||
|
||||
### 1. Validation
|
||||
@@ -41,36 +64,72 @@ Check all sources against the constraints above. Abort on any mismatch.
|
||||
|
||||
### 2. Bootstrap output from first source
|
||||
|
||||
Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
|
||||
Recursive file copy of `sources[0]` → `output`. Immediately after the copy:
|
||||
|
||||
- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
|
||||
- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
|
||||
- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.
|
||||
|
||||
This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.
|
||||
|
||||
### 3. For each subsequent source (parallel across partitions)
|
||||
|
||||
For each partition, process one source at a time sequentially:
|
||||
`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.
|
||||
|
||||
**a. Classify kmers**
|
||||
Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).
|
||||
|
||||
Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
||||
**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).
|
||||
|
||||
- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
|
||||
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
|
||||
**Pass 1 — classify kmers**
|
||||
|
||||
**b. Extend existing layers**
|
||||
Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
||||
|
||||
For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
|
||||
- **Hit**: kmer already in the destination; record for Pass 2.
|
||||
- **Miss**: push kmer into a `GraphDeBruijn` accumulator.
|
||||
|
||||
If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
|
||||
**New layer construction**
|
||||
|
||||
**c. Build new layer for new kmers**
|
||||
If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).
|
||||
|
||||
If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
||||
**Pass 2 — fill column builders**
|
||||
|
||||
**d. Update partition metadata**
|
||||
For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:
|
||||
|
||||
Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
|
||||
- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
|
||||
- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
|
||||
- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`
|
||||
|
||||
Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.
|
||||
|
||||
**Column prepending for new layers**
|
||||
|
||||
Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
||||
|
||||
**Close and update metadata**
|
||||
|
||||
Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.
|
||||
|
||||
### 4. Update index metadata
|
||||
|
||||
Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
|
||||
`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.
|
||||
|
||||
---
|
||||
|
||||
## `append_genome_column`
|
||||
|
||||
Defined on two concrete specialisations of `Layer<D>`:
|
||||
|
||||
```rust
|
||||
impl Layer<PersistentCompactIntMatrix> {
|
||||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||
}
|
||||
|
||||
impl Layer<PersistentBitMatrix> {
|
||||
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||
}
|
||||
```
|
||||
|
||||
Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.
|
||||
|
||||
---
|
||||
|
||||
@@ -78,27 +137,31 @@ Append the merged source's genome label(s) to `index.meta.genomes`. Write update
|
||||
|
||||
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
|
||||
|
||||
This is maintained by three mechanisms:
|
||||
Maintained by three mechanisms:
|
||||
|
||||
1. **Existing layers**: one column appended per source genome (`append_genome_column`).
|
||||
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
|
||||
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
|
||||
1. **Existing layers**: `n_src_total` columns appended (one per source genome).
|
||||
2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
|
||||
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.
|
||||
|
||||
The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
||||
The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
||||
|
||||
---
|
||||
|
||||
## New layer construction
|
||||
## Error variants relevant to merge
|
||||
|
||||
All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
|
||||
|
||||
De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
|
||||
| Variant | Condition |
|
||||
|---|---|
|
||||
| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
|
||||
| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
|
||||
| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
|
||||
| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
|
||||
| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |
|
||||
|
||||
---
|
||||
|
||||
## On-disk impact
|
||||
|
||||
After merging `G` genomes (1 existing + G-1 new sources):
|
||||
After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):
|
||||
|
||||
```
|
||||
partitions/
|
||||
@@ -106,28 +169,28 @@ partitions/
|
||||
index/
|
||||
meta.json ← n_layers updated if new layer added
|
||||
layer_0/
|
||||
mphf.bin ← unchanged (MPHF immutable)
|
||||
mphf.bin ← unchanged
|
||||
unitigs.bin ← unchanged
|
||||
evidence.bin ← unchanged
|
||||
presence/ ← created on first merge (Presence mode)
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv ← all-true (genome 0)
|
||||
col_000001.pbiv ← source 1 presence
|
||||
col_000000.pbiv ← all-true (genome 0 … G0-1)
|
||||
col_000001.pbiv ← next source
|
||||
...
|
||||
counts/ ← extended (Count mode)
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pciv ← genome 0 counts (from original build)
|
||||
col_000001.pciv ← source 1 counts
|
||||
col_000001.pciv ← next source
|
||||
...
|
||||
layer_1/ ← new layer (if new kmers found)
|
||||
layer_N/ ← new layer (if new kmers found)
|
||||
mphf.bin
|
||||
unitigs.bin
|
||||
evidence.bin
|
||||
presence/ or counts/
|
||||
meta.json {"n": N1, "n_cols": G}
|
||||
col_000000.pbiv ← all-false (genome 0 absent from this layer)
|
||||
col_000000.pbiv ← all-false (absent for existing genomes)
|
||||
...
|
||||
col_000001.pbiv ← source 1 presence in this new layer
|
||||
spectrums/
|
||||
<label>.json ← one file per genome, rebuilt from all sources
|
||||
index.meta ← complete genome list + evidence kind written at bootstrap
|
||||
```
|
||||
|
||||
The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
|
||||
|
||||
Reference in New Issue
Block a user