refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -26,13 +26,36 @@ Default mode is `Presence`. `Count` mode requires **all** source indexes to have
 All source indexes must satisfy:

 - `IndexState::Indexed` (fully built — `index.done` sentinel present)
- Same `kmer_size`, `minimizer_size`, `n_bits`
+- Same `kmer_size`, `minimizer_size`, `n_partitions`
+- Same evidence kind: all `Exact`, or all `Approx` with identical `(b, z)` parameters
 - If `Count` mode: all sources must have `with_counts=true`

 `--force`: if the output directory already exists, it is deleted before the merge begins.

 ---

+## Evidence compatibility
+
+`validate_evidence_compat(sources)` is called before any I/O. It compares each source's `EvidenceKind` against `sources[0]`:
+
+- All `Exact` → accepted, output uses `Exact`
+- All `Approx { b, z }` with same `(b, z)` → accepted, output uses those parameters
+- Any other combination → `OKIError::IncompatibleEvidence`, with a message directing the user to run `reindex` first
+
+Mixed exact/approx is a hard error, not a silent conversion.
+
+```rust
+fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind>
+```
+
+---
+
+## Genome label deduplication
+
+`compute_labels(sources, rename_duplicates)` assigns final genome labels across all sources before any file is written. The first occurrence of a label keeps the original name. Subsequent occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true, or trigger `OKIError::DuplicateGenomeLabel` otherwise.
+
+---
+
 ## Algorithm

 ### 1. Validation
@@ -41,36 +64,72 @@ Check all sources against the constraints above. Abort on any mismatch.

 ### 2. Bootstrap output from first source

-Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
+Recursive file copy of `sources[0]` → `output`. Immediately after the copy:
+
+- `index.meta` is rewritten with the final genome list (all sources, possibly renamed) and the effective evidence kind.
+- In `Presence` mode, any `counts/` directories inherited from source_0 are removed.
+- `spectrums/` from source_0 is removed and rebuilt from scratch across all sources, applying the (possibly renamed) labels.
+
+This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genomes occupy columns 0 … `n_dst_genomes - 1` in the destination.

 ### 3. For each subsequent source (parallel across partitions)

-For each partition, process one source at a time sequentially:
+`KmerPartition::merge_partition(i, sources, mode, n_dst_genomes, block_bits)` is called for each partition index `i`. `block_bits` is taken from `dst.meta.config.block_bits`.

-**a. Classify kmers**
+Each entry in `sources` is `(&KmerPartition, n_genomes)` where `n_genomes` is the column count that source contributes (> 1 when the source is itself a merged index).

-Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
+**First merge, Presence mode**: when `n_dst_genomes == 1`, `Layer::<()>::init_presence_matrix` is called on every existing destination layer before any source column is appended. This creates `presence/col_000000.pbiv` set all-true (genome 0 is present in every slot).

- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
+**Pass 1 — classify kmers**

-**b. Extend existing layers**
+Iterate all kmers from all source partitions (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:

-For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
+- **Hit**: kmer already in the destination; record for Pass 2.
+- **Miss**: push kmer into a `GraphDeBruijn` accumulator.

-If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
+**New layer construction**

-**c. Build new layer for new kmers**
+If the accumulator is non-empty, compute de Bruijn unitigs and call `Layer::<()>::build(&new_layer_dir, block_bits)`. All kmers absent from the destination — across **all** sources — accumulate into a **single** graph, producing one new layer per merge operation (not one per source).

-If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
+**Pass 2 — fill column builders**

-**d. Update partition metadata**
+For each source and each of its layers, re-iterate unitigs and look up stored values via `SrcLayerData::lookup(kmer, src_n)`:

-Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
+- `SrcLayerData::SetMembership` — no data directory exists; every kmer returns `vec![1; n_genomes]`
+- `SrcLayerData::Presence` — reads `PersistentBitMatrix` from `presence/`
+- `SrcLayerData::Count` — reads `PersistentCompactIntMatrix` from `counts/`
+
+Hits are routed to `exist_builders[dst_layer][src_col]`; misses are routed to `new_src_builders[src_col]`.
+
+**Column prepending for new layers**
+
+Before source columns are written to the new layer, `n_dst_genomes` absent columns (all-zero / all-false) are prepended — one per genome already in the index — so the column count invariant holds immediately after layer creation.
+
+**Close and update metadata**
+
+Close all builders; update `presence/meta.json` or `counts/meta.json` with `{"n": N, "n_cols": n_dst_genomes + n_src_total}`; increment `PartitionMeta::n_layers` if a new layer was added.

 ### 4. Update index metadata

-Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
+`index.meta` was already written during bootstrap with the complete genome list and evidence kind. No further update is needed after the partition loop.
+
+---
+
+## `append_genome_column`
+
+Defined on two concrete specialisations of `Layer<D>`:
+
+```rust
+impl Layer<PersistentCompactIntMatrix> {
+    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
+}
+
+impl Layer<PersistentBitMatrix> {
+    pub fn append_genome_column(layer_dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
+}
+```
+
+Each appends one column file to the matrix subdirectory (`counts/` or `presence/`). In `merge_partition`, columns are written directly via `PersistentBitVecBuilder` / `PersistentCompactIntVecBuilder` rather than through these helpers, but the invariant they enforce is the same.

 ---

@@ -78,27 +137,31 @@ Append the merged source's genome label(s) to `index.meta.genomes`. Write update

 After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.

-This is maintained by three mechanisms:
+Maintained by three mechanisms:

-1. **Existing layers**: one column appended per source genome (`append_genome_column`).
-2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
-3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
+1. **Existing layers**: `n_src_total` columns appended (one per source genome).
+2. **New layers created during merge**: `n_dst_genomes` absent columns prepended before source columns.
+3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0.

-The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
+The invariant is a precondition of `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.

 ---

-## New layer construction
+## Error variants relevant to merge

-All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
-
-De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
+| Variant | Condition |
+|---|---|
+| `OKIError::NotIndexed(path)` | Source not in `Indexed` state |
+| `OKIError::IncompatibleConfig` | Mismatched `kmer_size`, `minimizer_size`, or `n_partitions` |
+| `OKIError::MismatchedMode` | Count mode but a source has `with_counts=false` |
+| `OKIError::IncompatibleEvidence(msg)` | Mixed exact/approx or different approx `(b, z)` |
+| `OKIError::DuplicateGenomeLabel(label)` | Duplicate label and `rename_duplicates=false` |

 ---

 ## On-disk impact

-After merging `G` genomes (1 existing + G-1 new sources):
+After merging `G` genomes (sources_0 contributes `G0`, subsequent sources the rest):

 ```
 partitions/
@@ -106,28 +169,28 @@ partitions/
    index/
      meta.json              ← n_layers updated if new layer added
      layer_0/
-        mphf.bin             ← unchanged (MPHF immutable)
+        mphf.bin             ← unchanged
        unitigs.bin          ← unchanged
        evidence.bin         ← unchanged
        presence/            ← created on first merge (Presence mode)
          meta.json            {"n": N, "n_cols": G}
-          col_000000.pbiv      ← all-true (genome 0)
-          col_000001.pbiv      ← source 1 presence
+          col_000000.pbiv      ← all-true (genome 0 … G0-1)
+          col_000001.pbiv      ← next source
          ...
        counts/              ← extended (Count mode)
          meta.json            {"n": N, "n_cols": G}
          col_000000.pciv      ← genome 0 counts (from original build)
-          col_000001.pciv      ← source 1 counts
+          col_000001.pciv      ← next source
          ...
-      layer_1/               ← new layer (if new kmers found)
+      layer_N/               ← new layer (if new kmers found)
        mphf.bin
        unitigs.bin
        evidence.bin
        presence/ or counts/
          meta.json            {"n": N1, "n_cols": G}
-          col_000000.pbiv      ← all-false (genome 0 absent from this layer)
+          col_000000.pbiv      ← all-false (absent for existing genomes)
          ...
-          col_000001.pbiv      ← source 1 presence in this new layer
+spectrums/
+  <label>.json               ← one file per genome, rebuilt from all sources
+index.meta                   ← complete genome list + evidence kind written at bootstrap
 ```
-
-The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.