feat: Implement query subcommand for sequence-to-genome mapping

This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.
This commit is contained in:
Eric Coissac
2026-05-21 13:23:05 +02:00
parent c8e591fc78
commit 13599dd444
13 changed files with 762 additions and 19 deletions
+111
View File
@@ -0,0 +1,111 @@
# Query system
## Goal
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
---
## Input
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
---
## Algorithm
The query follows the same superkmer-based partitioning strategy used at indexing time.
```
for each query sequence:
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
for each superkmer:
route to partition p via minimiser hash
for each kmer in the superkmer:
lookup kmer in partition p (MPHF → evidence check → matrix row)
accumulate result into per-sequence accumulators
emit annotated sequence
```
Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
---
## Exact vs. approximate matching
### Exact (default)
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
### 1-mismatch (`--mismatch` flag)
For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
---
## Output format
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
```
>read_id {"kmer_total":150,"kmer_found":59,...}
ATCGATCG...
```
Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
---
## Annotation schema
### Summary mode (default)
| Key | Type | Condition | Semantics |
|---|---|---|---|
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
### Detail mode (`--detail`)
All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) k + 1`:
| Key | Type | Condition | Semantics |
|---|---|---|---|
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
---
## CLI
```
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
```
`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
---
## Future work
- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.