# Query system ## Goal Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping. --- ## Input - Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). - Sequences shorter than k bases are silently skipped. - Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time). --- ## Algorithm The query follows the same superkmer-based partitioning strategy used at indexing time. ``` for each batch of sequences: build QueryBatch: decompose all sequences into superkmers, deduplicate split superkmers by partition via minimiser hash for each partition p: query_partition(p, superkmers_routed_to_p) → load QueryLayer(s) for p → for each kmer in each superkmer: MphfLayer::find(kmer) broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer emit annotated sequences ``` Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it. Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration. --- ## Layer lookup: `MphfLayer::find` `MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`: ```rust pub fn find(&self, kmer: CanonicalKmer) -> Option { match &self.ev { LayerEvidence::Exact { .. } => self.find_exact(kmer), LayerEvidence::Approx { .. } => self.find_approx(kmer), } } ``` ### Exact layers `find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification. ### Approximate layers `find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`. For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly. ### `QueryLayer` variant selection `QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`: | Condition | Variant | Data returned per k-mer | |---|---|---| | `with_counts=true` and `counts/` exists | `Count` | raw count per genome | | `presence/` exists | `Presence` | 0/1 per genome (bit matrix) | | only `counts/` exists | `Count` | counts used as-is | | neither exists | `SetOnly` | 1 for every genome | --- ## `open()` vs `open_sequential()` `UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time). `UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for: - build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`); - the `unitig` subcommand, which iterates and prints unitigs without random access. `KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all. --- ## Presence / count mode at query time The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores: ``` genome_totals[g] += if presence { u32::from(v >= threshold) } else { v } ``` `presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match. --- ## Output format Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line. ``` >read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}} ATCGATCG... ``` Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`. --- ## Annotation schema (current implementation) | Key | Type | Condition | Semantics | |---|---|---|---| | `kmer_count` | int | always | k-mers with at least one match | | `kmer_missing` | int | `--count-missing` | k-mers absent from every layer | | `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) | `kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits. **Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning. --- ## CLI ``` obikmer query -i [--detail] [--mismatch] [--count-missing] [--force-presence] [--presence-threshold ] [-T ] [ ...] ``` `--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr. --- ## Future work - **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently. - **`--detail`**: per-position coverage vectors (`cov_`) per genome. - **Read classification** (`--classify`): assign each read to the genome with the highest match score. - **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value. - **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.