Files

T

Eric Coissac 694da5208e feat: add Findere z-window filtering and detail mode coverage tracking

Introduces the `--findere-z` CLI flag to override the index's sliding window parameter and implements `apply_findere` to filter k-mer hits using a z-consecutive positions window. Adds structural support for `--detail` mode, including per-sequence k-mer offsets, conditional allocation of per-position coverage vectors, and JSON serialization. Updates architecture documentation, CLI references, and annotation schemas to align with the new implementation, resolving prior discrepancies with `--detail` and `--mismatch` flags.

2026-05-26 15:43:17 +02:00

7.8 KiB

Raw Blame History

Query system

Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

Input

Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
Sequences shorter than k bases are silently skipped.
Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

for each batch of sequences:
    build QueryBatch: decompose all sequences into superkmers, deduplicate
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each kmer in each superkmer: MphfLayer::find(kmer)
        apply_findere(sk_kmer_results, effective_z)   ← per superkmer
        broadcast confirmed results back to each (seq_idx, kmer_offset)
    emit annotated sequences

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique RoutableSuperKmer is queried once per partition, and the result is broadcast to every SKDesc entry that references it.

Parallelism is not yet active in the current implementation: batches are processed sequentially on a single thread despite the --threads flag being parsed. The QueryBatch / split_by_partition design is structured to support per-partition parallelism in a future iteration.

Findere z-window filter

For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring z consecutive k-mers to all test positive before any of them is counted as a confirmed match.

apply_findere implements this as a sliding-window confirmation, independently for each genome:

fn apply_findere(
    results: &[Option<Box<[u32]>>],
    z: usize,
    n_genomes: usize,
) -> Vec<Option<Box<[u32]>>>

For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions [j, j+z) that contains i and where all z positions are present for g (i.e. results[pos] is Some(row) and row[g] > 0). The implementation uses a single O(n) sliding-window scan per genome.

Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as None.

Short superkmers: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, apply_findere returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.

Exact indexes: z is effectively 1 (every k-mer is its own window). apply_findere is a no-op.

Effective z at query time

effective_z is resolved at the start of run():

let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
    IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
    IndexMode::Exact => 1,
});

The -z CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.

Layer lookup: `MphfLayer::find`

MphfLayer::open(dir, mode: &IndexMode) receives the mode from PartitionMeta — no per-layer file is read. The caller (QueryLayer) never chooses the dispatch path: it is fixed at open time by LayerEvidence. See obilayeredmap for the full find / find_strict API.

`QueryLayer` variant selection

QueryLayer::open in query_layer.rs selects the data matrix to pair with MphfLayer:

Condition	Variant	Data returned per k-mer
`with_counts=true` and `counts/` exists	`Count`	raw count per genome
`presence/` exists	`Presence`	0/1 per genome (bit matrix)
only `counts/` exists	`Count`	counts used as-is
neither exists	`SetOnly`	1 for every genome

Presence / count mode at query time

The --force-presence flag and --presence-threshold control how per-genome values are accumulated, independently of what the index stores:

genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }

presence is true when --force-presence is set or when the index has no counts (!with_counts). The default presence_threshold is 1, so any nonzero count counts as a match.

Coverage vectors (`--detail`)

When --detail is requested, a 3-D accumulator cov[seq_idx][genome][kmer_pos] is allocated before the partition loop, with dimensions derived from batch.n_kmers:

cov[seq_idx][g][abs_pos] += contribution
where abs_pos = desc.kmer_offset + local_pos  (absolute kmer position in the sequence)

Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key "coverage".

`kmer_missing` semantics

kmer_missing counts k-mers that returned None from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.

Output format

Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.

>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
ATCGATCG...

With --detail:

>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
ATCGATCG...

Genome keys follow the iteration order of meta.genomes.

Annotation schema

Key	Type	Condition	Semantics
`kmer_count`	int	always	k-mers confirmed (post-Findere) with at least one genome match
`kmer_missing`	int	`--count-missing`	k-mers absent from the index entirely (pre-Findere None)
`kmer_strict_matches`	object	always	per-genome accumulated value (label → count or 0/1)
`coverage`	object	`--detail`	per-genome array of per-position contributions (label → [u32])

kmer_count + kmer_missing ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.

CLI

obikmer query <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-z <z>] [-T <threads>]
              <query.fa> [<query2.fa> ...]

Option	Default	Semantics
`-z` / `--findere-z`	from index metadata	Override Findere z parameter
`--detail`	off	Emit per-position coverage vectors in JSON
`--count-missing`	off	Add `kmer_missing` field to JSON
`--force-presence`	off	Report 0/1 per genome regardless of index counts
`--presence-threshold`	1	Minimum count to declare genome present
`-T` / `--threads`	all CPUs	Worker threads (parallelism not yet active)

--mismatch is accepted but currently ignored with a warning on stderr.

Future work

--mismatch: 1-mismatch approximate matching — generate 3·k single-substitution variants per k-mer, look each up independently.
Read classification (--classify): assign each read to the genome with the highest match score.
Parallelism: activate per-partition or per-sequence worker threads using the already-parsed --threads value.
Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.

7.8 KiB Raw Blame History