Files

T

Eric Coissac 8a0b898b4b docs: clarify query pipeline, Findere trick, and input formats

Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.

2026-05-30 15:59:12 +02:00

8.1 KiB

Raw Blame History

Query system

Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

Input

Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
Sequences shorter than k bases are silently skipped.
Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

for each chunk of sequences (parallel workers via obipipeline):
    build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
    allocate seq_results[seq_idx][smer_pos] = None  ← per-sequence s-mer result vectors
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each s-mer in each superkmer: MphfLayer::find(smer)
        fill seq_results[seq_idx][kmer_offset + j] from partition results
    for each sequence:
        apply_findere(seq_results[seq_idx], effective_z)  ← per full sequence
        accumulate confirmed k-mer results into acc and cov
    emit annotated sequences

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique RoutableSuperKmer is queried once per partition, and the result is broadcast to every SKDesc entry that references it.

Findere requires full-sequence aggregation. apply_findere is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.

Batches are processed in parallel via obipipeline workers; the --threads flag controls the number of worker threads.

Findere z-window filter

For approximate index modes, the index physically stores s-mers of size s = k_user − z + 1. At query time, set_k(s) is in effect, so queries naturally produce s-mer results. apply_findere then aggregates z consecutive s-mer results into one k_user-mer answer:

fn apply_findere(
    results: &[Option<Box<[u32]>>],  // N s-mer results
    z: usize,
    n_genomes: usize,
) -> Vec<Option<Box<[u32]>>>         // N − z + 1 k_user-mer results

Input length N (s-mers), output length N − z + 1 (k_user-mers).

For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values results[i..i+z][g] are nonzero (None counts as zero for all genomes). The scan is O(n) per genome.

Output values come from results[i] (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as None.

Short sequences: when the s-mer count is less than z, no complete window can form — apply_findere returns an empty vector. K-mers from sequences shorter than k_user are not emitted.

Exact indexes: z = 1, apply_findere is a passthrough (output length = input length).

Effective z at query time

effective_z is resolved at the start of run():

let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
    IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
    IndexMode::Exact => 1,
});

The -z CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.

Layer lookup: `MphfLayer::find`

MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read. The caller (QueryLayer) never chooses the dispatch path: it is fixed at open time by LayerEvidence. See obilayeredmap for the full find / find_strict API.

`QueryLayer` variant selection

QueryLayer::open in query_layer.rs selects the data matrix to pair with MphfLayer:

Condition	Variant	Data returned per k-mer
`with_counts=true` and `counts/` exists	`Count`	raw count per genome
`presence/` exists	`Presence`	0/1 per genome (bit matrix)
only `counts/` exists	`Count`	counts used as-is
neither exists	`SetOnly`	1 for every genome

Presence / count mode at query time

The --force-presence flag and --presence-threshold control how per-genome values are accumulated, independently of what the index stores:

genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }

presence is true when --force-presence is set or when the index has no counts (!with_counts). The default presence_threshold is 1, so any nonzero count counts as a match.

Coverage vectors (`--detail`)

When --detail is requested, a 3-D accumulator cov[seq_idx][genome][kmer_pos] is allocated after all partitions are queried, with dimensions derived from n_kmers_out = n_smers − z + 1 (k_user-mer positions, not s-mer positions):

cov[seq_idx][g][pos] += contribution
where pos is the k_user-mer index in the filtered (post-Findere) vector

Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key "coverage".

`kmer_missing` semantics

kmer_missing counts k_user-mer positions where the first s-mer (seq_results[seq_idx][pos]) is None — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).

Output format

Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.

>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
ATCGATCG...

With --detail:

>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
ATCGATCG...

Genome keys follow the iteration order of meta.genomes.

Annotation schema

Key	Type	Condition	Semantics
`kmer_count`	int	always	k-mers confirmed (post-Findere) with at least one genome match
`kmer_missing`	int	`--count-missing`	k-mers absent from the index entirely (pre-Findere None)
`kmer_strict_matches`	object	always	per-genome accumulated value (label → count or 0/1)
`coverage`	object	`--detail`	per-genome array of per-position contributions (label → [u32])

kmer_count + kmer_missing ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.

CLI

obikmer query <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-z <z>] [-T <threads>]
              <query.fa> [<query2.fa> ...]

Option	Default	Semantics
`-z` / `--findere-z`	from index metadata	Override Findere z parameter
`--detail`	off	Emit per-position coverage vectors in JSON
`--count-missing`	off	Add `kmer_missing` field to JSON
`--force-presence`	off	Report 0/1 per genome regardless of index counts
`--presence-threshold`	1	Minimum count to declare genome present
`-T` / `--threads`	all CPUs	Worker threads

--mismatch is accepted but currently ignored with a warning on stderr.

Future work

--mismatch: 1-mismatch approximate matching — generate 3·k single-substitution variants per k-mer, look each up independently.
Read classification (--classify): assign each read to the genome with the highest match score.
Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.

8.1 KiB Raw Blame History Unescape Escape