refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
Eric Coissac
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
+97 -34
View File
@@ -26,13 +26,36 @@ Default mode is `Presence`. `Count` mode requires **all** source indexes to have
All source indexes must satisfy:
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
- Same `kmer_size`, `minimizer_size`, `n_bits`
- Same `kmer_size`, `minimizer_size`, `n_partitions`
- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
- If `Count` mode: all sources must have `with_counts=true`
`--force`: if the output directory already exists, it is deleted before the merge begins.
---
## Evidence compatibility
`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:
- All `Exact` → accepted, output uses `Exact`
- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first
Mixed exact/approx is a hard error, not a silent conversion.
```rust
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
```
---
## Genome label deduplication
`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.
---
## Algorithm
### 1. Validation
@@ -41,36 +64,72 @@ Check all sources against the constraints above. Abort on any mismatch.
### 2. Bootstrap output from first source
Recursive file copy of `sources[0]``output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
Recursive file copy of `sources[0]``output`. Immediately after the copy:
- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.
This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.
### 3. For each subsequent source (parallel across partitions)
For each partition, process one source at a time sequentially:
`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.
**a. Classify kmers**
Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).
Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).
- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
**Pass 1 — classify kmers**
**b. Extend existing layers**
Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
- **Hit**: kmer already in the destination; record for Pass 2.
- **Miss**: push kmer into a `GraphDeBruijn` accumulator.
If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
**New layer construction**
**c. Build new layer for new kmers**
If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).
If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
**Pass 2 — fill column builders**
**d. Update partition metadata**
For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:
Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`
Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.
**Column prepending for new layers**
Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.
**Close and update metadata**
Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.
### 4. Update index metadata
Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.
---
## `append_genome_column`
Defined on two concrete specialisations of `Layer<D>`:
```rust
impl Layer<PersistentCompactIntMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
}
impl Layer<PersistentBitMatrix> {
pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
}
```
Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.
---
@@ -78,27 +137,31 @@ Append the merged source's genome label(s) to `index.meta.genomes`. Write update
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
This is maintained by three mechanisms:
Maintained by three mechanisms:
1. **Existing layers**: one column appended per source genome (`append_genome_column`).
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
1. **Existing layers**: `n_src_total` columns appended (one per source genome).
2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.
The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
---
## New layer construction
## Error variants relevant to merge
All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
| Variant | Condition |
|---|---|
| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |
---
## On-disk impact
After merging `G` genomes (1 existing + G-1 new sources):
After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):
```
partitions/
@@ -106,28 +169,28 @@ partitions/
index/
meta.json ← n_layers updated if new layer added
layer_0/
mphf.bin ← unchanged (MPHF immutable)
mphf.bin ← unchanged
unitigs.bin ← unchanged
evidence.bin ← unchanged
presence/ ← created on first merge (Presence mode)
meta.json {"n": N, "n_cols": G}
col_000000.pbiv ← all-true (genome 0)
col_000001.pbiv ← source 1 presence
col_000000.pbiv ← all-true (genome 0 … G0-1)
col_000001.pbiv ← next source
...
counts/ ← extended (Count mode)
meta.json {"n": N, "n_cols": G}
col_000000.pciv ← genome 0 counts (from original build)
col_000001.pciv ← source 1 counts
col_000001.pciv ← next source
...
layer_1/ ← new layer (if new kmers found)
layer_N/ ← new layer (if new kmers found)
mphf.bin
unitigs.bin
evidence.bin
presence/ or counts/
meta.json {"n": N1, "n_cols": G}
col_000000.pbiv ← all-false (genome 0 absent from this layer)
col_000000.pbiv ← all-false (absent for existing genomes)
...
col_000001.pbiv ← source 1 presence in this new layer
spectrums/
<label>.json ← one file per genome, rebuilt from all sources
index.meta ← complete genome list + evidence kind written at bootstrap
```
The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.