8a0b898b4b
Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
190 lines
8.1 KiB
Markdown
190 lines
8.1 KiB
Markdown
# Query system
|
||
|
||
## Goal
|
||
|
||
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
|
||
|
||
---
|
||
|
||
## Input
|
||
|
||
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
|
||
- Sequences shorter than k bases are silently skipped.
|
||
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
|
||
|
||
---
|
||
|
||
## Algorithm
|
||
|
||
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
||
|
||
```
|
||
for each chunk of sequences (parallel workers via obipipeline):
|
||
build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
|
||
allocate seq_results[seq_idx][smer_pos] = None ← per-sequence s-mer result vectors
|
||
split superkmers by partition via minimiser hash
|
||
for each partition p:
|
||
query_partition(p, superkmers_routed_to_p)
|
||
→ load QueryLayer(s) for p
|
||
→ for each s-mer in each superkmer: MphfLayer::find(smer)
|
||
fill seq_results[seq_idx][kmer_offset + j] from partition results
|
||
for each sequence:
|
||
apply_findere(seq_results[seq_idx], effective_z) ← per full sequence
|
||
accumulate confirmed k-mer results into acc and cov
|
||
emit annotated sequences
|
||
```
|
||
|
||
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
|
||
|
||
**Findere requires full-sequence aggregation.** `apply_findere` is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.
|
||
|
||
Batches are processed in parallel via `obipipeline` workers; the `--threads` flag controls the number of worker threads.
|
||
|
||
---
|
||
|
||
## Findere z-window filter
|
||
|
||
For approximate index modes, the index physically stores s-mers of size `s = k_user − z + 1`. At query time, `set_k(s)` is in effect, so queries naturally produce s-mer results. `apply_findere` then aggregates z consecutive s-mer results into one k_user-mer answer:
|
||
|
||
```rust
|
||
fn apply_findere(
|
||
results: &[Option<Box<[u32]>>], // N s-mer results
|
||
z: usize,
|
||
n_genomes: usize,
|
||
) -> Vec<Option<Box<[u32]>>> // N − z + 1 k_user-mer results
|
||
```
|
||
|
||
Input length N (s-mers), output length N − z + 1 (k_user-mers).
|
||
|
||
For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values `results[i..i+z][g]` are nonzero (`None` counts as zero for all genomes). The scan is O(n) per genome.
|
||
|
||
Output values come from `results[i]` (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as `None`.
|
||
|
||
**Short sequences**: when the s-mer count is less than z, no complete window can form — `apply_findere` returns an empty vector. K-mers from sequences shorter than k_user are not emitted.
|
||
|
||
**Exact indexes**: `z = 1`, `apply_findere` is a passthrough (output length = input length).
|
||
|
||
### Effective z at query time
|
||
|
||
`effective_z` is resolved at the start of `run()`:
|
||
|
||
```rust
|
||
let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
|
||
IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
|
||
IndexMode::Exact => 1,
|
||
});
|
||
```
|
||
|
||
The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.
|
||
|
||
---
|
||
|
||
## Layer lookup: `MphfLayer::find`
|
||
|
||
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.
|
||
|
||
### `QueryLayer` variant selection
|
||
|
||
`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:
|
||
|
||
| Condition | Variant | Data returned per k-mer |
|
||
|---|---|---|
|
||
| `with_counts=true` and `counts/` exists | `Count` | raw count per genome |
|
||
| `presence/` exists | `Presence` | 0/1 per genome (bit matrix) |
|
||
| only `counts/` exists | `Count` | counts used as-is |
|
||
| neither exists | `SetOnly` | 1 for every genome |
|
||
|
||
---
|
||
|
||
## Presence / count mode at query time
|
||
|
||
The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
|
||
|
||
```
|
||
genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
|
||
```
|
||
|
||
`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.
|
||
|
||
---
|
||
|
||
## Coverage vectors (`--detail`)
|
||
|
||
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated after all partitions are queried, with dimensions derived from `n_kmers_out = n_smers − z + 1` (k_user-mer positions, not s-mer positions):
|
||
|
||
```
|
||
cov[seq_idx][g][pos] += contribution
|
||
where pos is the k_user-mer index in the filtered (post-Findere) vector
|
||
```
|
||
|
||
Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key `"coverage"`.
|
||
|
||
---
|
||
|
||
## `kmer_missing` semantics
|
||
|
||
`kmer_missing` counts k_user-mer positions where the first s-mer (`seq_results[seq_idx][pos]`) is `None` — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).
|
||
|
||
---
|
||
|
||
## Output format
|
||
|
||
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
|
||
|
||
```
|
||
>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
|
||
ATCGATCG...
|
||
```
|
||
|
||
With `--detail`:
|
||
|
||
```
|
||
>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
|
||
ATCGATCG...
|
||
```
|
||
|
||
Genome keys follow the iteration order of `meta.genomes`.
|
||
|
||
---
|
||
|
||
## Annotation schema
|
||
|
||
| Key | Type | Condition | Semantics |
|
||
|---|---|---|---|
|
||
| `kmer_count` | int | always | k-mers confirmed (post-Findere) with at least one genome match |
|
||
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index entirely (pre-Findere None) |
|
||
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
|
||
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |
|
||
|
||
`kmer_count + kmer_missing` ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.
|
||
|
||
---
|
||
|
||
## CLI
|
||
|
||
```
|
||
obikmer query <index> [--detail] [--mismatch] [--count-missing]
|
||
[--force-presence] [--presence-threshold <n>]
|
||
[-z <z>] [-T <threads>]
|
||
<query.fa> [<query2.fa> ...]
|
||
```
|
||
|
||
| Option | Default | Semantics |
|
||
|---|---|---|
|
||
| `-z` / `--findere-z` | from index metadata | Override Findere z parameter |
|
||
| `--detail` | off | Emit per-position coverage vectors in JSON |
|
||
| `--count-missing` | off | Add `kmer_missing` field to JSON |
|
||
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
|
||
| `--presence-threshold` | 1 | Minimum count to declare genome present |
|
||
| `-T` / `--threads` | all CPUs | Worker threads |
|
||
|
||
`--mismatch` is accepted but currently ignored with a warning on stderr.
|
||
|
||
---
|
||
|
||
## Future work
|
||
|
||
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
|
||
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
|
||
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
|