Files

T

Eric Coissac 13599dd444 feat: Implement query subcommand for sequence-to-genome mapping

This change introduces the `query` CLI command and its supporting infrastructure for sequence-to-genome mapping and k-mer matching. It adds a `QueryLayer` abstraction backed by MPHF and persistent matrices, exposes the index partition for direct querying, and implements `Hash`/`Eq` for `RoutableSuperKmer`. The command ingests sequence batches, deduplicates superkmers, routes them to index partitions for parallel exact or 1-mismatch matching, and outputs results as FASTA records annotated with JSON metadata. Includes `serde_json` dependency addition, module exports, and documentation updates.

2026-05-21 18:56:41 +02:00

4.8 KiB

Raw Blame History

Query system

Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

Input

Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
Sequences shorter than k bases are silently skipped.
Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

for each query sequence:
    decompose into superkmers  (non-ACGT breaks, same minimiser scheme as indexing)
    for each superkmer:
        route to partition p via minimiser hash
        for each kmer in the superkmer:
            lookup kmer in partition p  (MPHF → evidence check → matrix row)
            accumulate result into per-sequence accumulators
    emit annotated sequence

Parallelism is per sequence: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.

Exact vs. approximate matching

Exact (default)

Standard MPHF lookup followed by evidence check. O(1) per k-mer.

1-mismatch (`--mismatch` flag)

For each k-mer of the query, generate all 3·k single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are summed into the result for that k-mer position.

If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
Exact and approximate matches are tracked separately in the output (see annotation schema below).
The superkmer routing optimisation is not used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
Cost: up to 3·k MPHF probes per k-mer position vs. 1 in exact mode.

Output format

Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.

>read_id {"kmer_total":150,"kmer_found":59,...}
ATCGATCG...

Genome order in all list-valued annotations follows the genome order recorded in index.meta.

Annotation schema

Summary mode (default)

Key	Type	Condition	Semantics
`kmer_total`	int	always	total k-mers in the (masked) sequence
`kmer_found`	int	always	k-mers with at least one match (exact or approx)
`kmer_missing`	int	`--count-missing`	k-mers absent from the index
`kmer_match`	list[int]	always	per-genome matched k-mer count (exact + approx)
`kmer_match_exact`	list[int]	`--mismatch`	per-genome exact match count
`kmer_match_approx`	list[int]	`--mismatch`	per-genome approx match count
`count_match`	list[int]	count index	per-genome sum of index counts for matched k-mers

kmer_match[i] is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.

count_match[i] sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.

Detail mode (`--detail`)

All summary keys, plus per-position coverage vectors — one list per genome, length len(sequence) − k + 1:

Key	Type	Condition	Semantics
`cov_<i>`	list[int]	`--detail`	coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent
`cov_exact_<i>`	list[int]	`--detail` + `--mismatch`	exact-match contribution per position
`cov_approx_<i>`	list[int]	`--detail` + `--mismatch`	approx-match contribution per position

Genome indices in key names are 0-based integers matching the index.meta genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.

CLI

obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>

--summary is the default; --detail implies --summary (all summary keys are always present).

Future work

Read classification (--classify): assign each read to the genome with the highest kmer_match score; emit as a single annotation key.
Whitelist / blacklist filtering: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.

4.8 KiB Raw Blame History Unescape Escape