refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
@@ -27,10 +27,10 @@ part_XXXXX/
|
||||
|
||||
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
|
||||
|
||||
`MphfLayer::build` is called on the unitig file:
|
||||
`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:
|
||||
|
||||
1. **Pass 1**: iterate all canonical kmers from `unitigs.bin` in parallel, build and store `mphf.bin` (ptr_hash).
|
||||
2. **Pass 2**: iterate sequentially, fill `evidence.bin`, call the mode-specific `fill_slot` callback.
|
||||
1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
|
||||
2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.
|
||||
|
||||
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
|
||||
|
||||
@@ -105,9 +105,12 @@ Each layer is a self-contained unit. See [obilayeredmap](obilayeredmap.md) for t
|
||||
|
||||
```
|
||||
layer_i/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
|
||||
unitigs.bin.idx — random-access block index (block_bits controls granularity)
|
||||
mphf.bin — ptr_hash phase-2 MPHF
|
||||
evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
|
||||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||||
layer_meta.json — evidence kind, recorded at build time
|
||||
```
|
||||
|
||||
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
|
||||
@@ -116,9 +119,45 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
|
||||
2. Collect kmers of B not present in any layer → set `B \ A`.
|
||||
3. Build layer 1 from `B \ A` (dereplicate → count → De Bruijn → unitigs → `MphfLayer::build`).
|
||||
|
||||
### Evidence modes
|
||||
|
||||
Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
|
||||
|
||||
**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
|
||||
|
||||
**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
|
||||
|
||||
### Build functions
|
||||
|
||||
```
|
||||
MphfLayer::build(dir, block_bits, fill_slot)
|
||||
Pass 1: par_iter over chunks via .idx → build mphf.bin
|
||||
Pass 2: sequential iter → fill evidence.bin + call fill_slot
|
||||
|
||||
MphfLayer::build_exact_evidence(dir, block_bits)
|
||||
Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); no .idx required on entry
|
||||
|
||||
MphfLayer::build_approx_evidence(dir, b, z)
|
||||
Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); never writes .idx
|
||||
|
||||
MphfLayer::build_evidence(dir, kind, block_bits)
|
||||
Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
|
||||
```
|
||||
|
||||
`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
|
||||
|
||||
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
|
||||
|
||||
### Membership verification
|
||||
|
||||
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from `(chunk_id, rank)` and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.
|
||||
ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:
|
||||
|
||||
- **Exact**: decode `(chunk_id, rank)` from `evidence.bin`; reconstruct the kmer via `unitigs.verify_canonical_kmer`; compare to query.
|
||||
- **Approx**: compare `kmer.seq_hash()` to the b-bit fingerprint stored at the slot.
|
||||
|
||||
A mismatch in either mode means the kmer is absent from this layer; probe the next layer.
|
||||
|
||||
### Query algorithm
|
||||
|
||||
@@ -126,12 +165,12 @@ ptr_hash maps any input to a valid slot — it does not natively detect absent k
|
||||
fn query(kmer) → Option<(layer_index, slot)>:
|
||||
for (i, layer) in layers.iter().enumerate():
|
||||
slot = layer.mphf.index(kmer)
|
||||
if layer.evidence.decode(slot) matches kmer:
|
||||
if layer.evidence.matches(slot, kmer): // exact or approx dispatch
|
||||
return Some((i, slot))
|
||||
return None
|
||||
```
|
||||
|
||||
Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.
|
||||
`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
||||
|
||||
### Merging layers
|
||||
|
||||
|
||||
Reference in New Issue
Block a user