# Query system ## Goal Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping. --- ## Input - Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). - Sequences shorter than k bases are silently skipped. - Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time). --- ## Algorithm The query follows the same superkmer-based partitioning strategy used at indexing time. ``` for each batch of sequences: build QueryBatch: decompose all sequences into superkmers, deduplicate split superkmers by partition via minimiser hash for each partition p: query_partition(p, superkmers_routed_to_p) → load QueryLayer(s) for p → for each kmer in each superkmer: MphfLayer::find(kmer) apply_findere(sk_kmer_results, effective_z) ← per superkmer broadcast confirmed results back to each (seq_idx, kmer_offset) emit annotated sequences ``` Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it. Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration. --- ## Findere z-window filter For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match. `apply_findere` implements this as a sliding-window confirmation, independently for each genome: ```rust fn apply_findere( results: &[Option>], z: usize, n_genomes: usize, ) -> Vec>> ``` For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome. Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`. **Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends. **Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op. ### Effective z at query time `effective_z` is resolved at the start of `run()`: ```rust let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence { IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize, IndexMode::Exact => 1, }); ``` The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity. --- ## Layer lookup: `MphfLayer::find` `MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API. ### `QueryLayer` variant selection `QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`: | Condition | Variant | Data returned per k-mer | |---|---|---| | `with_counts=true` and `counts/` exists | `Count` | raw count per genome | | `presence/` exists | `Presence` | 0/1 per genome (bit matrix) | | only `counts/` exists | `Count` | counts used as-is | | neither exists | `SetOnly` | 1 for every genome | --- ## Presence / count mode at query time The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores: ``` genome_totals[g] += if presence { u32::from(v >= threshold) } else { v } ``` `presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match. --- ## Coverage vectors (`--detail`) When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`: ``` cov[seq_idx][g][abs_pos] += contribution where abs_pos = desc.kmer_offset + local_pos (absolute kmer position in the sequence) ``` Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`. --- ## `kmer_missing` semantics `kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing. --- ## Output format Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line. ``` >read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}} ATCGATCG... ``` With `--detail`: ``` >read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}} ATCGATCG... ``` Genome keys follow the iteration order of `meta.genomes`. --- ## Annotation schema | Key | Type | Condition | Semantics | |---|---|---|---| | `kmer_count` | int | always | k-mers confirmed (post-Findere) with at least one genome match | | `kmer_missing` | int | `--count-missing` | k-mers absent from the index entirely (pre-Findere None) | | `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) | | `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) | `kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter. --- ## CLI ``` obikmer query [--detail] [--mismatch] [--count-missing] [--force-presence] [--presence-threshold ] [-z ] [-T ] [ ...] ``` | Option | Default | Semantics | |---|---|---| | `-z` / `--findere-z` | from index metadata | Override Findere z parameter | | `--detail` | off | Emit per-position coverage vectors in JSON | | `--count-missing` | off | Add `kmer_missing` field to JSON | | `--force-presence` | off | Report 0/1 per genome regardless of index counts | | `--presence-threshold` | 1 | Minimum count to declare genome present | | `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) | `--mismatch` is accepted but currently ignored with a warning on stderr. --- ## Future work - **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently. - **Read classification** (`--classify`): assign each read to the genome with the highest match score. - **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value. - **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.