Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
---
## Algorithm
The query follows the same superkmer-based partitioning strategy used at indexing time.
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
**Findere requires full-sequence aggregation.**`apply_findere` is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.
Batches are processed in parallel via `obipipeline` workers; the `--threads` flag controls the number of worker threads.
For approximate index modes, the index physically stores s-mers of size `s = k_user − z + 1`. At query time, `set_k(s)` is in effect, so queries naturally produce s-mer results. `apply_findere` then aggregates z consecutive s-mer results into one k_user-mer answer:
Input length N (s-mers), output length N − z + 1 (k_user-mers).
For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values `results[i..i+z][g]` are nonzero (`None` counts as zero for all genomes). The scan is O(n) per genome.
Output values come from `results[i]` (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as `None`.
**Short sequences**: when the s-mer count is less than z, no complete window can form — `apply_findere` returns an empty vector. K-mers from sequences shorter than k_user are not emitted.
The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.
---
## Layer lookup: `MphfLayer::find`
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.
`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated after all partitions are queried, with dimensions derived from `n_kmers_out = n_smers − z + 1` (k_user-mer positions, not s-mer positions):
`kmer_missing` counts k_user-mer positions where the first s-mer (`seq_results[seq_idx][pos]`) is `None` — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).
`kmer_count + kmer_missing` ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.