docs: clarify query pipeline, Findere trick, and input formats

Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
This commit is contained in:
Eric Coissac
2026-05-30 15:54:13 +02:00
parent 708b0abf9b
commit 8a0b898b4b
4 changed files with 150 additions and 36 deletions
+28 -24
View File
@@ -8,7 +8,7 @@ Given a set of query sequences, determine for each sequence how many of its k-me
## Input
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
@@ -19,45 +19,50 @@ Given a set of query sequences, determine for each sequence how many of its k-me
The query follows the same superkmer-based partitioning strategy used at indexing time.
```
for each batch of sequences:
build QueryBatch: decompose all sequences into superkmers, deduplicate
for each chunk of sequences (parallel workers via obipipeline):
build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
allocate seq_results[seq_idx][smer_pos] = None ← per-sequence s-mer result vectors
split superkmers by partition via minimiser hash
for each partition p:
query_partition(p, superkmers_routed_to_p)
→ load QueryLayer(s) for p
→ for each kmer in each superkmer: MphfLayer::find(kmer)
apply_findere(sk_kmer_results, effective_z) ← per superkmer
broadcast confirmed results back to each (seq_idx, kmer_offset)
→ for each s-mer in each superkmer: MphfLayer::find(smer)
fill seq_results[seq_idx][kmer_offset + j] from partition results
for each sequence:
apply_findere(seq_results[seq_idx], effective_z) ← per full sequence
accumulate confirmed k-mer results into acc and cov
emit annotated sequences
```
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.
**Findere requires full-sequence aggregation.** `apply_findere` is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.
Batches are processed in parallel via `obipipeline` workers; the `--threads` flag controls the number of worker threads.
---
## Findere z-window filter
For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match.
`apply_findere` implements this as a sliding-window confirmation, independently for each genome:
For approximate index modes, the index physically stores s-mers of size `s = k_user z + 1`. At query time, `set_k(s)` is in effect, so queries naturally produce s-mer results. `apply_findere` then aggregates z consecutive s-mer results into one k_user-mer answer:
```rust
fn apply_findere(
results: &[Option<Box<[u32]>>],
results: &[Option<Box<[u32]>>], // N s-mer results
z: usize,
n_genomes: usize,
) -> Vec<Option<Box<[u32]>>>
) -> Vec<Option<Box<[u32]>>> // N z + 1 k_user-mer results
```
For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.
Input length N (s-mers), output length N z + 1 (k_user-mers).
Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.
For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values `results[i..i+z][g]` are nonzero (`None` counts as zero for all genomes). The scan is O(n) per genome.
**Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.
Output values come from `results[i]` (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as `None`.
**Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.
**Short sequences**: when the s-mer count is less than z, no complete window can form — `apply_findere` returns an empty vector. K-mers from sequences shorter than k_user are not emitted.
**Exact indexes**: `z = 1`, `apply_findere` is a passthrough (output length = input length).
### Effective z at query time
@@ -105,20 +110,20 @@ genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
## Coverage vectors (`--detail`)
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated after all partitions are queried, with dimensions derived from `n_kmers_out = n_smers z + 1` (k_user-mer positions, not s-mer positions):
```
cov[seq_idx][g][abs_pos] += contribution
where abs_pos = desc.kmer_offset + local_pos (absolute kmer position in the sequence)
cov[seq_idx][g][pos] += contribution
where pos is the k_user-mer index in the filtered (post-Findere) vector
```
Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.
Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key `"coverage"`.
---
## `kmer_missing` semantics
`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.
`kmer_missing` counts k_user-mer positions where the first s-mer (`seq_results[seq_idx][pos]`) is `None` — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).
---
@@ -151,7 +156,7 @@ Genome keys follow the iteration order of `meta.genomes`.
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |
`kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.
`kmer_count + kmer_missing` ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.
---
@@ -171,7 +176,7 @@ obikmer query <index> [--detail] [--mismatch] [--count-missing]
| `--count-missing` | off | Add `kmer_missing` field to JSON |
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
| `--presence-threshold` | 1 | Minimum count to declare genome present |
| `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) |
| `-T` / `--threads` | all CPUs | Worker threads |
`--mismatch` is accepted but currently ignored with a warning on stderr.
@@ -181,5 +186,4 @@ obikmer query <index> [--detail] [--mismatch] [--count-missing]
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.