# Query system ## Goal Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping. --- ## Input - Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). - Sequences shorter than k bases are silently skipped. - Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time). --- ## Algorithm The query follows the same superkmer-based partitioning strategy used at indexing time. ``` for each query sequence: decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing) for each superkmer: route to partition p via minimiser hash for each kmer in the superkmer: lookup kmer in partition p (MPHF → evidence check → matrix row) accumulate result into per-sequence accumulators emit annotated sequence ``` Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions. --- ## Exact vs. approximate matching ### Exact (default) Standard MPHF lookup followed by evidence check. O(1) per k-mer. ### 1-mismatch (`--mismatch` flag) For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position. - If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated. - Exact and approximate matches are tracked separately in the output (see annotation schema below). - The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser. - Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode. --- ## Output format Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line. ``` >read_id {"kmer_total":150,"kmer_found":59,...} ATCGATCG... ``` Genome order in all list-valued annotations follows the genome order recorded in `index.meta`. --- ## Annotation schema ### Summary mode (default) | Key | Type | Condition | Semantics | |---|---|---|---| | `kmer_total` | int | always | total k-mers in the (masked) sequence | | `kmer_found` | int | always | k-mers with at least one match (exact or approx) | | `kmer_missing` | int | `--count-missing` | k-mers absent from the index | | `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) | | `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count | | `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count | | `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers | `kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index. `count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes. ### Detail mode (`--detail`) All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`: | Key | Type | Condition | Semantics | |---|---|---|---| | `cov_` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent | | `cov_exact_` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position | | `cov_approx_` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position | Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers. --- ## CLI ``` obikmer query -i [--summary | --detail] [--mismatch] [--count-missing] ``` `--summary` is the default; `--detail` implies `--summary` (all summary keys are always present). --- ## Future work - **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key. - **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.