refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
+81
-48
@@ -19,34 +19,80 @@ Given a set of query sequences, determine for each sequence how many of its k-me
|
||||
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
||||
|
||||
```
|
||||
for each query sequence:
|
||||
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
|
||||
for each superkmer:
|
||||
route to partition p via minimiser hash
|
||||
for each kmer in the superkmer:
|
||||
lookup kmer in partition p (MPHF → evidence check → matrix row)
|
||||
accumulate result into per-sequence accumulators
|
||||
emit annotated sequence
|
||||
for each batch of sequences:
|
||||
build QueryBatch: decompose all sequences into superkmers, deduplicate
|
||||
split superkmers by partition via minimiser hash
|
||||
for each partition p:
|
||||
query_partition(p, superkmers_routed_to_p)
|
||||
→ load QueryLayer(s) for p
|
||||
→ for each kmer in each superkmer: MphfLayer::find(kmer)
|
||||
broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer
|
||||
emit annotated sequences
|
||||
```
|
||||
|
||||
Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
|
||||
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
|
||||
|
||||
Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.
|
||||
|
||||
---
|
||||
|
||||
## Exact vs. approximate matching
|
||||
## Layer lookup: `MphfLayer::find`
|
||||
|
||||
### Exact (default)
|
||||
`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`:
|
||||
|
||||
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
|
||||
```rust
|
||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||
match &self.ev {
|
||||
LayerEvidence::Exact { .. } => self.find_exact(kmer),
|
||||
LayerEvidence::Approx { .. } => self.find_approx(kmer),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 1-mismatch (`--mismatch` flag)
|
||||
### Exact layers
|
||||
|
||||
For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
|
||||
`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification.
|
||||
|
||||
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
|
||||
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
|
||||
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
|
||||
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
|
||||
### Approximate layers
|
||||
|
||||
`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`.
|
||||
|
||||
For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly.
|
||||
|
||||
### `QueryLayer` variant selection
|
||||
|
||||
`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:
|
||||
|
||||
| Condition | Variant | Data returned per k-mer |
|
||||
|---|---|---|
|
||||
| `with_counts=true` and `counts/` exists | `Count` | raw count per genome |
|
||||
| `presence/` exists | `Presence` | 0/1 per genome (bit matrix) |
|
||||
| only `counts/` exists | `Count` | counts used as-is |
|
||||
| neither exists | `SetOnly` | 1 for every genome |
|
||||
|
||||
---
|
||||
|
||||
## `open()` vs `open_sequential()`
|
||||
|
||||
`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time).
|
||||
|
||||
`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for:
|
||||
- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`);
|
||||
- the `unitig` subcommand, which iterates and prints unitigs without random access.
|
||||
|
||||
`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all.
|
||||
|
||||
---
|
||||
|
||||
## Presence / count mode at query time
|
||||
|
||||
The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
|
||||
|
||||
```
|
||||
genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
|
||||
```
|
||||
|
||||
`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.
|
||||
|
||||
---
|
||||
|
||||
@@ -55,57 +101,44 @@ For each k-mer of the query, generate all `3·k` single-substitution variants. E
|
||||
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
|
||||
|
||||
```
|
||||
>read_id {"kmer_total":150,"kmer_found":59,...}
|
||||
>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}}
|
||||
ATCGATCG...
|
||||
```
|
||||
|
||||
Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
|
||||
Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`.
|
||||
|
||||
---
|
||||
|
||||
## Annotation schema
|
||||
|
||||
### Summary mode (default)
|
||||
## Annotation schema (current implementation)
|
||||
|
||||
| Key | Type | Condition | Semantics |
|
||||
|---|---|---|---|
|
||||
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
|
||||
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
|
||||
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
|
||||
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
|
||||
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
|
||||
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
|
||||
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
|
||||
| `kmer_count` | int | always | k-mers with at least one match |
|
||||
| `kmer_missing` | int | `--count-missing` | k-mers absent from every layer |
|
||||
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
|
||||
|
||||
`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
|
||||
`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits.
|
||||
|
||||
`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
|
||||
|
||||
### Detail mode (`--detail`)
|
||||
|
||||
All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:
|
||||
|
||||
| Key | Type | Condition | Semantics |
|
||||
|---|---|---|---|
|
||||
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
|
||||
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
|
||||
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
|
||||
|
||||
Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
|
||||
**Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning.
|
||||
|
||||
---
|
||||
|
||||
## CLI
|
||||
|
||||
```
|
||||
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
|
||||
obikmer query -i <index> [--detail] [--mismatch] [--count-missing]
|
||||
[--force-presence] [--presence-threshold <n>]
|
||||
[-T <threads>] <query.fa> [<query2.fa> ...]
|
||||
```
|
||||
|
||||
`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
|
||||
`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr.
|
||||
|
||||
---
|
||||
|
||||
## Future work
|
||||
|
||||
- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
|
||||
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.
|
||||
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
|
||||
- **`--detail`**: per-position coverage vectors (`cov_<i>`) per genome.
|
||||
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
|
||||
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
|
||||
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
|
||||
|
||||
Reference in New Issue
Block a user