feat: Implement query subcommand for sequence-to-genome mapping

This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.
2026-05-21 13:23:05 +02:00
parent c8e591fc78
commit 13599dd444
13 changed files with 762 additions and 19 deletions
@@ -0,0 +1,111 @@
+# Query system
+
+## Goal
+
+Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
+
+---
+
+## Input
+
+- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
+- Sequences shorter than k bases are silently skipped.
+- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
+
+---
+
+## Algorithm
+
+The query follows the same superkmer-based partitioning strategy used at indexing time.
+
+```
+for each query sequence:
+    decompose into superkmers  (non-ACGT breaks, same minimiser scheme as indexing)
+    for each superkmer:
+        route to partition p via minimiser hash
+        for each kmer in the superkmer:
+            lookup kmer in partition p  (MPHF → evidence check → matrix row)
+            accumulate result into per-sequence accumulators
+    emit annotated sequence
+```
+
+Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
+
+---
+
+## Exact vs. approximate matching
+
+### Exact (default)
+
+Standard MPHF lookup followed by evidence check. O(1) per k-mer.
+
+### 1-mismatch (`--mismatch` flag)
+
+For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
+
+- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
+- Exact and approximate matches are tracked separately in the output (see annotation schema below).
+- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
+- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
+
+---
+
+## Output format
+
+Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
+
+```
+>read_id {"kmer_total":150,"kmer_found":59,...}
+ATCGATCG...
+```
+
+Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
+
+---
+
+## Annotation schema
+
+### Summary mode (default)
+
+| Key | Type | Condition | Semantics |
+|---|---|---|---|
+| `kmer_total` | int | always | total k-mers in the (masked) sequence |
+| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
+| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
+| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
+| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
+| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
+| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
+
+`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
+
+`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
+
+### Detail mode (`--detail`)
+
+All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:
+
+| Key | Type | Condition | Semantics |
+|---|---|---|---|
+| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
+| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
+| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
+
+Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
+
+---
+
+## CLI
+
+```
+obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
+```
+
+`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
+
+---
+
+## Future work
+
+- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
+- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.