Skip to content

Query system

Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.


Input

  • Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
  • Sequences shorter than k bases are silently skipped.
  • Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

for each chunk of sequences (parallel workers via obipipeline):
    build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
    allocate seq_results[seq_idx][smer_pos] = None  ← per-sequence s-mer result vectors
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each s-mer in each superkmer: MphfLayer::find(smer)
        fill seq_results[seq_idx][kmer_offset + j] from partition results
    for each sequence:
        apply_findere(seq_results[seq_idx], effective_z)  ← per full sequence
        accumulate confirmed k-mer results into acc and cov
    emit annotated sequences

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique RoutableSuperKmer is queried once per partition, and the result is broadcast to every SKDesc entry that references it.

Findere requires full-sequence aggregation. apply_findere is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.

Batches are processed in parallel via obipipeline workers; the --threads flag controls the number of worker threads.


Findere z-window filter

For approximate index modes, the index physically stores s-mers of size s = k_user − z + 1. At query time, set_k(s) is in effect, so queries naturally produce s-mer results. apply_findere then aggregates z consecutive s-mer results into one k_user-mer answer:

fn apply_findere(
    results: &[Option<Box<[u32]>>],  // N s-mer results
    z: usize,
    n_genomes: usize,
) -> Vec<Option<Box<[u32]>>>         // N − z + 1 k_user-mer results

Input length N (s-mers), output length N − z + 1 (k_user-mers).

For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values results[i..i+z][g] are nonzero (None counts as zero for all genomes). The scan is O(n) per genome.

Output values come from results[i] (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as None.

Short sequences: when the s-mer count is less than z, no complete window can form — apply_findere returns an empty vector. K-mers from sequences shorter than k_user are not emitted.

Exact indexes: z = 1, apply_findere is a passthrough (output length = input length).

Effective z at query time

effective_z is resolved at the start of run():

let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
    IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
    IndexMode::Exact => 1,
});

The -z CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.


Layer lookup: MphfLayer::find

MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read. The caller (QueryLayer) never chooses the dispatch path: it is fixed at open time by LayerEvidence. See obilayeredmap for the full find / find_strict API.

QueryLayer variant selection

QueryLayer::open in query_layer.rs selects the data matrix to pair with MphfLayer:

Condition Variant Data returned per k-mer
with_counts=true and counts/ exists Count raw count per genome
presence/ exists Presence 0/1 per genome (bit matrix)
only counts/ exists Count counts used as-is
neither exists SetOnly 1 for every genome

Presence / count mode at query time

The --force-presence flag and --presence-threshold control how per-genome values are accumulated, independently of what the index stores:

genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }

presence is true when --force-presence is set or when the index has no counts (!with_counts). The default presence_threshold is 1, so any nonzero count counts as a match.


Coverage vectors (--detail)

When --detail is requested, a 3-D accumulator cov[seq_idx][genome][kmer_pos] is allocated after all partitions are queried, with dimensions derived from n_kmers_out = n_smers − z + 1 (k_user-mer positions, not s-mer positions):

cov[seq_idx][g][pos] += contribution
where pos is the k_user-mer index in the filtered (post-Findere) vector

Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key "coverage".


kmer_missing semantics

kmer_missing counts k_user-mer positions where the first s-mer (seq_results[seq_idx][pos]) is None — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).


Output format

Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.

>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
ATCGATCG...

With --detail:

>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
ATCGATCG...

Genome keys follow the iteration order of meta.genomes.


Annotation schema

Key Type Condition Semantics
kmer_count int always k-mers confirmed (post-Findere) with at least one genome match
kmer_missing int --count-missing k-mers absent from the index entirely (pre-Findere None)
kmer_strict_matches object always per-genome accumulated value (label → count or 0/1)
coverage object --detail per-genome array of per-position contributions (label → [u32])

kmer_count + kmer_missing ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.


CLI

obikmer query <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-z <z>] [-T <threads>]
              <query.fa> [<query2.fa> ...]
Option Default Semantics
-z / --findere-z from index metadata Override Findere z parameter
--detail off Emit per-position coverage vectors in JSON
--count-missing off Add kmer_missing field to JSON
--force-presence off Report 0/1 per genome regardless of index counts
--presence-threshold 1 Minimum count to declare genome present
-T / --threads all CPUs Worker threads

--mismatch is accepted but currently ignored with a warning on stderr.


Future work

  • --mismatch: 1-mismatch approximate matching — generate 3·k single-substitution variants per k-mer, look each up independently.
  • Read classification (--classify): assign each read to the genome with the highest match score.
  • Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.