docmd/architecture/query.md

# Query system

## Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

---

## Input

- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

---

## Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

```
for each batch of sequences:
    build QueryBatch: decompose all sequences into superkmers, deduplicate
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each kmer in each superkmer: MphfLayer::find(kmer)
        broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer
    emit annotated sequences
```

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.

Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.

---

## Layer lookup: `MphfLayer::find`

`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`:

```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
    match &self.ev {
        LayerEvidence::Exact { .. }  => self.find_exact(kmer),
        LayerEvidence::Approx { .. } => self.find_approx(kmer),
    }
}
```

### Exact layers

`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification.

### Approximate layers

`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`.

For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly.

### `QueryLayer` variant selection

`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:

| Condition | Variant | Data returned per k-mer |
|---|---|---|
| `with_counts=true` and `counts/` exists | `Count` | raw count per genome |
| `presence/` exists | `Presence` | 0/1 per genome (bit matrix) |
| only `counts/` exists | `Count` | counts used as-is |
| neither exists | `SetOnly` | 1 for every genome |

---

## `open()` vs `open_sequential()`

`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time).

`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for:
- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`);
- the `unitig` subcommand, which iterates and prints unitigs without random access.

`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all.

---

## Presence / count mode at query time

The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:

```
genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
```

`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.

---

## Output format

Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.

```
>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}}
ATCGATCG...
```

Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`.

---

## Annotation schema (current implementation)

| Key | Type | Condition | Semantics |
|---|---|---|---|
| `kmer_count` | int | always | k-mers with at least one match |
| `kmer_missing` | int | `--count-missing` | k-mers absent from every layer |
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |

`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits.

**Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning.

---

## CLI

```
obikmer query -i <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-T <threads>] <query.fa> [<query2.fa> ...]
```

`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr.

---

## Future work

- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
- **`--detail`**: per-position coverage vectors (`cov_<i>`) per genome.
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			`# Query system`

			`## Goal`

			`Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.`

			`---`

			`## Input`

			`- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).`
			`- Sequences shorter than k bases are silently skipped.`
			`- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).`

			`---`

			`## Algorithm`

			`The query follows the same superkmer-based partitioning strategy used at indexing time.`

			```
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`for each batch of sequences:`
			`build QueryBatch: decompose all sequences into superkmers, deduplicate`
			`split superkmers by partition via minimiser hash`
			`for each partition p:`
			`query_partition(p, superkmers_routed_to_p)`
			`→ load QueryLayer(s) for p`
			`→ for each kmer in each superkmer: MphfLayer::find(kmer)`
			`broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer`
			`emit annotated sequences`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.

			Parallelism is not yet active in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			## Layer lookup: `MphfLayer::find`

			`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`:

			```rust
			`pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {`
			`match &self.ev {`
			`LayerEvidence::Exact { .. } => self.find_exact(kmer),`
			`LayerEvidence::Approx { .. } => self.find_approx(kmer),`
			`}`
			`}`
			```

			`### Exact layers`

			`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification.

			`### Approximate layers`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: 1/2^b per k-mer query. No `.idx` file is needed; the layer carries only `fingerprint.bin`.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is 1/2^(b·z). The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			### `QueryLayer` variant selection
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`\| Condition \| Variant \| Data returned per k-mer \|`
			`\|---\|---\|---\|`
			\| `with_counts=true` and `counts/` exists \| `Count` \| raw count per genome \|
			\| `presence/` exists \| `Presence` \| 0/1 per genome (bit matrix) \|
			\| only `counts/` exists \| `Count` \| counts used as-is \|
			\| neither exists \| `SetOnly` \| 1 for every genome \|
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			## `open()` vs `open_sequential()`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time).

			`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for:
			- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`);
			- the `unitig` subcommand, which iterates and prints unitigs without random access.

			`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all.

			`---`

			`## Presence / count mode at query time`

			The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			```
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`## Output format`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			```
			`>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}}`
			`ATCGATCG...`
			```
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`---`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`## Annotation schema (current implementation)`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`\| Key \| Type \| Condition \| Semantics \|`
			`\|---\|---\|---\|---\|`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			\| `kmer_count` \| int \| always \| k-mers with at least one match \|
			\| `kmer_missing` \| int \| `--count-missing` \| k-mers absent from every layer \|
			\| `kmer_strict_matches` \| object \| always \| per-genome accumulated value (label → count or 0/1) \|

			`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Note on doc/impl divergence: the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

			`## CLI`

			```
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`obikmer query -i <index> [--detail] [--mismatch] [--count-missing]`
			`[--force-presence] [--presence-threshold <n>]`
			`[-T <threads>] <query.fa> [<query2.fa> ...]`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

			`## Future work`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			- `--mismatch`: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
			- `--detail`: per-position coverage vectors (`cov_<i>`) per genome.
			- Read classification (`--classify`): assign each read to the genome with the highest match score.
			- Parallelism: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
			`- Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.`