112 lines
4.8 KiB
Markdown
112 lines
4.8 KiB
Markdown
|
|
# Query system
|
|||
|
|
|
|||
|
|
## Goal
|
|||
|
|
|
|||
|
|
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Input
|
|||
|
|
|
|||
|
|
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
|
|||
|
|
- Sequences shorter than k bases are silently skipped.
|
|||
|
|
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Algorithm
|
|||
|
|
|
|||
|
|
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
for each query sequence:
|
|||
|
|
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
|
|||
|
|
for each superkmer:
|
|||
|
|
route to partition p via minimiser hash
|
|||
|
|
for each kmer in the superkmer:
|
|||
|
|
lookup kmer in partition p (MPHF → evidence check → matrix row)
|
|||
|
|
accumulate result into per-sequence accumulators
|
|||
|
|
emit annotated sequence
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Exact vs. approximate matching
|
|||
|
|
|
|||
|
|
### Exact (default)
|
|||
|
|
|
|||
|
|
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
|
|||
|
|
|
|||
|
|
### 1-mismatch (`--mismatch` flag)
|
|||
|
|
|
|||
|
|
For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
|
|||
|
|
|
|||
|
|
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
|
|||
|
|
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
|
|||
|
|
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
|
|||
|
|
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Output format
|
|||
|
|
|
|||
|
|
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
>read_id {"kmer_total":150,"kmer_found":59,...}
|
|||
|
|
ATCGATCG...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Annotation schema
|
|||
|
|
|
|||
|
|
### Summary mode (default)
|
|||
|
|
|
|||
|
|
| Key | Type | Condition | Semantics |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
|
|||
|
|
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
|
|||
|
|
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
|
|||
|
|
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
|
|||
|
|
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
|
|||
|
|
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
|
|||
|
|
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
|
|||
|
|
|
|||
|
|
`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
|
|||
|
|
|
|||
|
|
`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
|
|||
|
|
|
|||
|
|
### Detail mode (`--detail`)
|
|||
|
|
|
|||
|
|
All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:
|
|||
|
|
|
|||
|
|
| Key | Type | Condition | Semantics |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
|
|||
|
|
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
|
|||
|
|
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
|
|||
|
|
|
|||
|
|
Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## CLI
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Future work
|
|||
|
|
|
|||
|
|
- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
|
|||
|
|
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.
|