refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
@@ -88,9 +88,7 @@ Implemented as a three-step pipeline in `count_partition()`:
|
||||
2. **Provisional MPHF** (ptr_hash): built from `sorted_unique.bin` via `new_from_par_iter(f0, ...)`. Stored to `mphf1.bin`; `sorted_unique.bin` deleted immediately.
|
||||
3. **Accumulation pass**: re-read dereplicated superkmers; for each kmer, `slot = mphf.index(kmer.raw())`, increment `counts1[slot]` by the superkmer COUNT. Stored in a `PersistentCompactIntVec` (`counts1.bin`).
|
||||
|
||||
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (`kmer_spectrum_raw.json`) is written.
|
||||
|
||||
Abundance filter applied here: kmers with `total_count < q` are discarded. `q` is a collection parameter (0 = keep all, including singletons for ≤1x data).
|
||||
At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (`spectrums/{label}.json`) is written to the index root.
|
||||
|
||||
No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.
|
||||
|
||||
@@ -140,19 +138,70 @@ Output: `unitigs.bin` — the permanent evidence structure for the partition. Ea
|
||||
|
||||
## Phase 6 — MPHF construction and index finalisation
|
||||
|
||||
Built once on the definitive kmer set (all kmers in all unitigs of the partition). See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for the current implementation.
|
||||
`build_index_layer` is called per partition (in parallel via `build_layers`) with the following parameters sourced from `IndexConfig`:
|
||||
|
||||
- `block_bits` — from `IndexConfig::block_bits`; controls the `.idx` block size (2^block_bits unitig chunks per block) for exact evidence
|
||||
- `evidence` — `EvidenceKind::Exact` or `EvidenceKind::Approx { b, z }`; propagated unchanged from `IndexConfig::evidence`
|
||||
- `min_ab` / `max_ab` — abundance bounds applied before graph construction
|
||||
- `with_counts` — whether to store kmer counts alongside set membership
|
||||
|
||||
**Abundance filtering:** when `min_ab > 1` or `max_ab.is_some()`, the provisional `mphf1.bin` and `counts1.bin` produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in `counts1` satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).
|
||||
|
||||
```
|
||||
kmers from unitigs → MPHF → mphf.bin
|
||||
→ evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
|
||||
→ payload : counts/ (mode 2) or presence/ (mode 3)
|
||||
for each kmer in dereplicated super-kmer:
|
||||
ab = counts1[mphf1.index(kmer.raw())]
|
||||
if ab < min_ab || ab > max_ab: skip
|
||||
graph.push(kmer)
|
||||
```
|
||||
|
||||
The MPHF is built in two passes over `unitigs.bin`: parallel pass for `mphf.bin`, sequential pass for `evidence.bin` and payload. The exact kmer count is available from the unitig index (`unitigs.bin.idx`) before the passes begin.
|
||||
**Graph build and unitig write:**
|
||||
|
||||
**Exact verification via unitig evidence:**
|
||||
The surviving kmers are fed into `GraphDeBruijn`, which computes degrees and yields unitigs. Unitigs are written to `layer_0/unitigs.bin` via a `UnitigFileWriter`.
|
||||
|
||||
`unitigs.bin` serves as the evidence structure. The MPHF maps every input to `[0, N)` including absent kmers — the unitig read-back (via `evidence.bin`) is the only correct membership test.
|
||||
**MPHF and evidence build:**
|
||||
|
||||
`Layer::build` (membership-only) or `Layer::<PersistentCompactIntMatrix>::build` (with counts) is called next. Internally, `MphfLayer::build` performs two passes:
|
||||
|
||||
1. **Pass 1 (parallel):** build `unitigs.bin.idx` (block size = 2^`block_bits`) then construct the MPHF from all canonical kmers in `unitigs.bin`; store to `mphf.bin`.
|
||||
2. **Pass 2 (sequential):** for each kmer in `unitigs.bin`, compute its slot and write `evidence.bin` (`chunk_id: 25 bits | rank: 7 bits` packed into a `u32`); also invoke the payload callback (`fill_slot`) to populate `counts/` if `with_counts`.
|
||||
|
||||
After `Layer::build` completes, `layer_meta.json` records `EvidenceKind::Exact`.
|
||||
|
||||
**Approximate evidence override:**
|
||||
|
||||
If `evidence` is `EvidenceKind::Approx { b, z }`, `build_approx_evidence` is called immediately after `Layer::build`. It overwrites the exact evidence bundle with `fingerprint.bin` (b-bit hash per slot) and rewrites `layer_meta.json` with `EvidenceKind::Approx { b, z }`. No `.idx` file is needed at query time in this mode.
|
||||
|
||||
```
|
||||
// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
|
||||
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
|
||||
// (evidence.bin left on disk but not used)
|
||||
```
|
||||
|
||||
**Partition metadata:**
|
||||
|
||||
After all layer files are written, `PartitionMeta { n_layers: 1 }` is serialised to `index/meta.json` inside the partition directory. This file is required by `LayeredMap::open` for subsequent merge operations.
|
||||
|
||||
**File layout per partition after phase 6:**
|
||||
|
||||
```
|
||||
part_XXXXX/
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers: 1 }
|
||||
layer_0/
|
||||
unitigs.bin ← permanent evidence (all modes)
|
||||
unitigs.bin.idx ← block index (exact mode only)
|
||||
mphf.bin ← MPHF
|
||||
evidence.bin ← exact evidence (exact mode)
|
||||
fingerprint.bin ← b-bit fingerprints (approx mode)
|
||||
layer_meta.json ← EvidenceKind tag
|
||||
counts/ ← PersistentCompactIntMatrix (with_counts only)
|
||||
```
|
||||
|
||||
**Cleanup:** unless `--keep-intermediate` is set, `remove_build_artifacts` deletes `dereplicated.skmer.zst`, `mphf1.bin`, and `counts1.bin` after all partitions are indexed.
|
||||
|
||||
See [obilayeredmap](obilayeredmap.md) and [MPHF selection](mphf.md) for data structure details.
|
||||
|
||||
**Query path (exact evidence):**
|
||||
|
||||
```
|
||||
query kmer q
|
||||
@@ -164,4 +213,12 @@ query kmer q
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
```
|
||||
|
||||
`superkmers.bin.gz` is no longer needed at this point and can be deleted.
|
||||
**Query path (approximate evidence):**
|
||||
|
||||
```
|
||||
query kmer q
|
||||
→ MPHF(q) → slot s
|
||||
→ fingerprint[s] matches seq_hash(q)?
|
||||
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
|
||||
→ no : kmer absent
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user