docmd/architecture/query.md

# Query system

## Goal

Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.

---

## Input

- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).

---

## Algorithm

The query follows the same superkmer-based partitioning strategy used at indexing time.

```
for each query sequence:
    decompose into superkmers  (non-ACGT breaks, same minimiser scheme as indexing)
    for each superkmer:
        route to partition p via minimiser hash
        for each kmer in the superkmer:
            lookup kmer in partition p  (MPHF → evidence check → matrix row)
            accumulate result into per-sequence accumulators
    emit annotated sequence
```

Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.

---

## Exact vs. approximate matching

### Exact (default)

Standard MPHF lookup followed by evidence check. O(1) per k-mer.

### 1-mismatch (`--mismatch` flag)

For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.

- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.

---

## Output format

Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.

```
>read_id {"kmer_total":150,"kmer_found":59,...}
ATCGATCG...
```

Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.

---

## Annotation schema

### Summary mode (default)

| Key | Type | Condition | Semantics |
|---|---|---|---|
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |

`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.

`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.

### Detail mode (`--detail`)

All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:

| Key | Type | Condition | Semantics |
|---|---|---|---|
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |

Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.

---

## CLI

```
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
```

`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).

---

## Future work

- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.