This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.
4.8 KiB
Query system
Goal
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
Input
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
Algorithm
The query follows the same superkmer-based partitioning strategy used at indexing time.
for each query sequence:
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
for each superkmer:
route to partition p via minimiser hash
for each kmer in the superkmer:
lookup kmer in partition p (MPHF → evidence check → matrix row)
accumulate result into per-sequence accumulators
emit annotated sequence
Parallelism is per sequence: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
Exact vs. approximate matching
Exact (default)
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
1-mismatch (--mismatch flag)
For each k-mer of the query, generate all 3·k single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are summed into the result for that k-mer position.
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
- The superkmer routing optimisation is not used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
- Cost: up to
3·kMPHF probes per k-mer position vs. 1 in exact mode.
Output format
Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.
>read_id {"kmer_total":150,"kmer_found":59,...}
ATCGATCG...
Genome order in all list-valued annotations follows the genome order recorded in index.meta.
Annotation schema
Summary mode (default)
| Key | Type | Condition | Semantics |
|---|---|---|---|
kmer_total |
int | always | total k-mers in the (masked) sequence |
kmer_found |
int | always | k-mers with at least one match (exact or approx) |
kmer_missing |
int | --count-missing |
k-mers absent from the index |
kmer_match |
list[int] | always | per-genome matched k-mer count (exact + approx) |
kmer_match_exact |
list[int] | --mismatch |
per-genome exact match count |
kmer_match_approx |
list[int] | --mismatch |
per-genome approx match count |
count_match |
list[int] | count index | per-genome sum of index counts for matched k-mers |
kmer_match[i] is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
count_match[i] sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
Detail mode (--detail)
All summary keys, plus per-position coverage vectors — one list per genome, length len(sequence) − k + 1:
| Key | Type | Condition | Semantics |
|---|---|---|---|
cov_<i> |
list[int] | --detail |
coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
cov_exact_<i> |
list[int] | --detail + --mismatch |
exact-match contribution per position |
cov_approx_<i> |
list[int] | --detail + --mismatch |
approx-match contribution per position |
Genome indices in key names are 0-based integers matching the index.meta genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
CLI
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
--summary is the default; --detail implies --summary (all summary keys are always present).
Future work
- Read classification (
--classify): assign each read to the genome with the highestkmer_matchscore; emit as a single annotation key. - Whitelist / blacklist filtering: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.