Files

T

Eric Coissac 036d044291 refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.

2026-05-26 10:04:25 +02:00

6.6 KiB

Raw Blame History

Query system

Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

Input

Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
Sequences shorter than k bases are silently skipped.
Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

for each batch of sequences:
    build QueryBatch: decompose all sequences into superkmers, deduplicate
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each kmer in each superkmer: MphfLayer::find(kmer)
        broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer
    emit annotated sequences

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique RoutableSuperKmer is queried once per partition, and the result is broadcast to every SKDesc entry that references it.

Parallelism is not yet active in the current implementation: batches are processed sequentially on a single thread despite the --threads flag being parsed. The QueryBatch / split_by_partition design is structured to support per-partition parallelism in a future iteration.

Layer lookup: `MphfLayer::find`

MphfLayer::open reads layer_meta.json and loads either exact or approximate evidence. The caller (QueryLayer::find) never chooses the dispatch path — it is fixed at open time by LayerEvidence:

pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
    match &self.ev {
        LayerEvidence::Exact { .. }  => self.find_exact(kmer),
        LayerEvidence::Approx { .. } => self.find_approx(kmer),
    }
}

Exact layers

find_exact maps the k-mer through the MPHF to a slot, then calls UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer) to confirm the stored k-mer matches. Zero false positives. Requires UnitigFileReader::open() (random-access via .idx); open_sequential() cannot serve random-access verification.

Approximate layers

find_approx maps the k-mer through the MPHF, then checks a stored b-bit fingerprint of the canonical hash. False-positive rate: 1/2^b per k-mer query. No .idx file is needed; the layer carries only fingerprint.bin.

For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is 1/2^(b·z). The z parameter is recorded in layer_meta.json at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly.

`QueryLayer` variant selection

QueryLayer::open in query_layer.rs selects the data matrix to pair with MphfLayer:

Condition	Variant	Data returned per k-mer
`with_counts=true` and `counts/` exists	`Count`	raw count per genome
`presence/` exists	`Presence`	0/1 per genome (bit matrix)
only `counts/` exists	`Count`	counts used as-is
neither exists	`SetOnly`	1 for every genome

`open()` vs `open_sequential()`

UnitigFileReader::open() loads the .idx block-offset table, enabling random access to individual unitig chunks. It is required whenever verify_canonical_kmer is called (exact layers at query time).

UnitigFileReader::open_sequential() skips the .idx and supports only forward iteration. It is sufficient for:

build passes that scan all unitigs sequentially (build_exact_evidence, build_approx_evidence);
the unitig subcommand, which iterates and prints unitigs without random access.

KmerIndex::open() (called by query::run) triggers MphfLayer::open for each layer, which calls UnitigFileReader::open() for exact layers. Approximate layers do not open a unitig reader at all.

Presence / count mode at query time

The --force-presence flag and --presence-threshold control how per-genome values are accumulated, independently of what the index stores:

genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }

presence is true when --force-presence is set or when the index has no counts (!with_counts). The default presence_threshold is 1, so any nonzero count counts as a match.

Output format

Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.

>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}}
ATCGATCG...

Genome keys in kmer_strict_matches are genome labels from index.meta. Key order follows iteration order of meta.genomes.

Annotation schema (current implementation)

Key	Type	Condition	Semantics
`kmer_count`	int	always	k-mers with at least one match
`kmer_missing`	int	`--count-missing`	k-mers absent from every layer
`kmer_strict_matches`	object	always	per-genome accumulated value (label → count or 0/1)

kmer_count counts matched k-mer positions (incremented once per Some(row) hit regardless of how many genomes are covered). kmer_missing counts None hits.

Note on doc/impl divergence: the doc previously used keys kmer_total, kmer_found, and kmer_match (list). The implementation uses kmer_count (int, matched only) and kmer_strict_matches (object keyed by genome label). --mismatch and --detail are parsed but not yet implemented and emit a warning.

CLI

obikmer query -i <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-T <threads>] <query.fa> [<query2.fa> ...]

--mismatch and --detail are accepted but currently ignored with a warning on stderr.

Future work

--mismatch: 1-mismatch approximate matching — generate 3·k single-substitution variants per k-mer, look each up independently.
--detail: per-position coverage vectors (cov_<i>) per genome.
Read classification (--classify): assign each read to the genome with the highest match score.
Parallelism: activate per-partition or per-sequence worker threads using the already-parsed --threads value.
Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.

6.6 KiB Raw Blame History