docmd/architecture/query.md

# Query system

## Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

---

## Input

- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

---

## Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

```
for each batch of sequences:
    build QueryBatch: decompose all sequences into superkmers, deduplicate
    split superkmers by partition via minimiser hash
    for each partition p:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each kmer in each superkmer: MphfLayer::find(kmer)
        apply_findere(sk_kmer_results, effective_z)   ← per superkmer
        broadcast confirmed results back to each (seq_idx, kmer_offset)
    emit annotated sequences
```

Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.

Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.

---

## Findere z-window filter

For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match.

`apply_findere` implements this as a sliding-window confirmation, independently for each genome:

```rust
fn apply_findere(
    results: &[Option<Box<[u32]>>],
    z: usize,
    n_genomes: usize,
) -> Vec<Option<Box<[u32]>>>
```

For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.

Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.

**Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.

**Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.

### Effective z at query time

`effective_z` is resolved at the start of `run()`:

```rust
let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
    IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
    IndexMode::Exact => 1,
});
```

The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.

---

## Layer lookup: `MphfLayer::find`

`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.

### `QueryLayer` variant selection

`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:

| Condition | Variant | Data returned per k-mer |
|---|---|---|
| `with_counts=true` and `counts/` exists | `Count` | raw count per genome |
| `presence/` exists | `Presence` | 0/1 per genome (bit matrix) |
| only `counts/` exists | `Count` | counts used as-is |
| neither exists | `SetOnly` | 1 for every genome |

---

## Presence / count mode at query time

The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:

```
genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
```

`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.

---

## Coverage vectors (`--detail`)

When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:

```
cov[seq_idx][g][abs_pos] += contribution
where abs_pos = desc.kmer_offset + local_pos  (absolute kmer position in the sequence)
```

Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.

---

## `kmer_missing` semantics

`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.

---

## Output format

Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.

```
>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
ATCGATCG...
```

With `--detail`:

```
>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
ATCGATCG...
```

Genome keys follow the iteration order of `meta.genomes`.

---

## Annotation schema

| Key | Type | Condition | Semantics |
|---|---|---|---|
| `kmer_count` | int | always | k-mers confirmed (post-Findere) with at least one genome match |
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index entirely (pre-Findere None) |
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |

`kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.

---

## CLI

```
obikmer query <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
              [-z <z>] [-T <threads>]
              <query.fa> [<query2.fa> ...]
```

| Option | Default | Semantics |
|---|---|---|
| `-z` / `--findere-z` | from index metadata | Override Findere z parameter |
| `--detail` | off | Emit per-position coverage vectors in JSON |
| `--count-missing` | off | Add `kmer_missing` field to JSON |
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
| `--presence-threshold` | 1 | Minimum count to declare genome present |
| `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) |

`--mismatch` is accepted but currently ignored with a warning on stderr.

---

## Future work

- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			`# Query system`

			`## Goal`

			`Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.`

			`---`

			`## Input`

			`- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).`
			`- Sequences shorter than k bases are silently skipped.`
			`- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).`

			`---`

			`## Algorithm`

			`The query follows the same superkmer-based partitioning strategy used at indexing time.`

			```
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`for each batch of sequences:`
			`build QueryBatch: decompose all sequences into superkmers, deduplicate`
			`split superkmers by partition via minimiser hash`
			`for each partition p:`
			`query_partition(p, superkmers_routed_to_p)`
			`→ load QueryLayer(s) for p`
			`→ for each kmer in each superkmer: MphfLayer::find(kmer)`
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`apply_findere(sk_kmer_results, effective_z) ← per superkmer`
			`broadcast confirmed results back to each (seq_idx, kmer_offset)`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`emit annotated sequences`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.

			Parallelism is not yet active in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`## Findere z-window filter`

			`For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring z consecutive k-mers to all test positive before any of them is counted as a confirmed match.`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`apply_findere` implements this as a sliding-window confirmation, independently for each genome:
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
			```rust
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`fn apply_findere(`
			`results: &[Option<Box<[u32]>>],`
			`z: usize,`
			`n_genomes: usize,`
			`) -> Vec<Option<Box<[u32]>>>`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			```

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.

			Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			Short superkmers: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			Exact indexes: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`### Effective z at query time`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`effective_z` is resolved at the start of `run()`:

			```rust
			`let effective_z = args.findere_z.unwrap_or_else(\|\| match idx.meta().config.evidence {`
			`IndexMode::Approx { z, .. } \| IndexMode::Hybrid { z, .. } => z as usize,`
			`IndexMode::Exact => 1,`
			`});`
			```

			The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.

			`---`

			## Layer lookup: `MphfLayer::find`

			`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			### `QueryLayer` variant selection
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`QueryLayer::open` in `query_layer.rs` selects the data matrix to pair with `MphfLayer`:
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`\| Condition \| Variant \| Data returned per k-mer \|`
			`\|---\|---\|---\|`
			\| `with_counts=true` and `counts/` exists \| `Count` \| raw count per genome \|
			\| `presence/` exists \| `Presence` \| 0/1 per genome (bit matrix) \|
			\| only `counts/` exists \| `Count` \| counts used as-is \|
			\| neither exists \| `SetOnly` \| 1 for every genome \|
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`## Presence / count mode at query time`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			```
			`genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }`
			```
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`presence` is true when `--force-presence` is set or when the index has no counts (`!with_counts`). The default `presence_threshold` is 1, so any nonzero count counts as a match.
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
			`---`

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			## Coverage vectors (`--detail`)
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			```
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`cov[seq_idx][g][abs_pos] += contribution`
			`where abs_pos = desc.kmer_offset + local_pos (absolute kmer position in the sequence)`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.

			`---`

			## `kmer_missing` semantics

			`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`## Output format`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`Output sequences are written in OBITools4 format: the original sequence with a JSON annotation map in the title line.`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			```
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`ATCGATCG...`
			```
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			With `--detail`:

			```
			`>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}`
			`ATCGATCG...`
			```

			Genome keys follow the iteration order of `meta.genomes`.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`---`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`## Annotation schema`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`\| Key \| Type \| Condition \| Semantics \|`
			`\|---\|---\|---\|---\|`
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			\| `kmer_count` \| int \| always \| k-mers confirmed (post-Findere) with at least one genome match \|
			\| `kmer_missing` \| int \| `--count-missing` \| k-mers absent from the index entirely (pre-Findere None) \|
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			\| `kmer_strict_matches` \| object \| always \| per-genome accumulated value (label → count or 0/1) \|
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			\| `coverage` \| object \| `--detail` \| per-genome array of per-position contributions (label → [u32]) \|
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

			`## CLI`

			```
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`obikmer query <index> [--detail] [--mismatch] [--count-missing]`
refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`[--force-presence] [--presence-threshold <n>]`
feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`[-z <z>] [-T <threads>]`
			`<query.fa> [<query2.fa> ...]`
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00			```

feat: add Findere z-window filtering and detail mode coverage tracking 2026-05-26 15:25:57 +02:00			`\| Option \| Default \| Semantics \|`
			`\|---\|---\|---\|`
			\| `-z` / `--findere-z` \| from index metadata \| Override Findere z parameter \|
			\| `--detail` \| off \| Emit per-position coverage vectors in JSON \|
			\| `--count-missing` \| off \| Add `kmer_missing` field to JSON \|
			\| `--force-presence` \| off \| Report 0/1 per genome regardless of index counts \|
			\| `--presence-threshold` \| 1 \| Minimum count to declare genome present \|
			\| `-T` / `--threads` \| all CPUs \| Worker threads (parallelism not yet active) \|

			`--mismatch` is accepted but currently ignored with a warning on stderr.
feat: Implement query subcommand for sequence-to-genome mapping 2026-05-21 13:23:05 +02:00
			`---`

			`## Future work`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			- `--mismatch`: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
			- Read classification (`--classify`): assign each read to the genome with the highest match score.
			- Parallelism: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
			`- Whitelist / blacklist filtering: threshold-based accept/reject on per-genome match scores.`