feat: add Findere z-window filtering and detail mode coverage tracking

Introduces the `--findere-z` CLI flag to override the index's sliding window parameter and implements `apply_findere` to filter k-mer hits using a z-consecutive positions window. Adds structural support for `--detail` mode, including per-sequence k-mer offsets, conditional allocation of per-position coverage vectors, and JSON serialization. Updates architecture documentation, CLI references, and annotation schemas to align with the new implementation, resolving prior discrepancies with `--detail` and `--mismatch` flags.
2026-05-26 15:25:57 +02:00
parent 26ab165807
commit 694da5208e
2 changed files with 218 additions and 73 deletions
@@ -26,7 +26,8 @@ for each batch of sequences:
        query_partition(p, superkmers_routed_to_p)
            → load QueryLayer(s) for p
            → for each kmer in each superkmer: MphfLayer::find(kmer)
-        broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer
+        apply_findere(sk_kmer_results, effective_z)   ← per superkmer
+        broadcast confirmed results back to each (seq_idx, kmer_offset)
    emit annotated sequences
 ```

@@ -36,28 +37,46 @@ Parallelism is **not yet active** in the current implementation: batches are pro

 ---

-## Layer lookup: `MphfLayer::find`
+## Findere z-window filter

-`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`:
+For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match.
+
+`apply_findere` implements this as a sliding-window confirmation, independently for each genome:

 ```rust
-pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
-    match &self.ev {
-        LayerEvidence::Exact { .. }  => self.find_exact(kmer),
-        LayerEvidence::Approx { .. } => self.find_approx(kmer),
-    }
-}
+fn apply_findere(
+    results: &[Option<Box<[u32]>>],
+    z: usize,
+    n_genomes: usize,
+) -> Vec<Option<Box<[u32]>>>
 ```

-### Exact layers
+For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.

-`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification.
+Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.

-### Approximate layers
+**Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.

-`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`.
+**Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.

-For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly.
+### Effective z at query time
+
+`effective_z` is resolved at the start of `run()`:
+
+```rust
+let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
+    IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
+    IndexMode::Exact => 1,
+});
+```
+
+The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.
+
+---
+
+## Layer lookup: `MphfLayer::find`
+
+`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.

 ### `QueryLayer` variant selection

@@ -72,18 +91,6 @@ For a query window of z consecutive k-mers (Findere scheme), the false-positive

 ---

-## `open()` vs `open_sequential()`
-
-`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time).
-
-`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for:
- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`);
- the `unitig` subcommand, which iterates and prints unitigs without random access.
-
-`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all.
-
---
-
 ## Presence / count mode at query time

 The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
@@ -96,49 +103,83 @@ genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }

 ---

+## Coverage vectors (`--detail`)
+
+When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:
+
+```
+cov[seq_idx][g][abs_pos] += contribution
+where abs_pos = desc.kmer_offset + local_pos  (absolute kmer position in the sequence)
+```
+
+Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.
+
+---
+
+## `kmer_missing` semantics
+
+`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.
+
+---
+
 ## Output format

 Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.

 ```
->read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}}
+>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
 ATCGATCG...
 ```

-Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`.
+With `--detail`:
+
+```
+>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
+ATCGATCG...
+```
+
+Genome keys follow the iteration order of `meta.genomes`.

 ---

-## Annotation schema (current implementation)
+## Annotation schema

 | Key | Type | Condition | Semantics |
 |---|---|---|---|
-| `kmer_count` | int | always | k-mers with at least one match |
-| `kmer_missing` | int | `--count-missing` | k-mers absent from every layer |
+| `kmer_count` | int | always | k-mers confirmed (post-Findere) with at least one genome match |
+| `kmer_missing` | int | `--count-missing` | k-mers absent from the index entirely (pre-Findere None) |
 | `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
+| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |

-`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits.
-
-**Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning.
+`kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.

 ---

 ## CLI

 ```
-obikmer query -i <index> [--detail] [--mismatch] [--count-missing]
+obikmer query <index> [--detail] [--mismatch] [--count-missing]
              [--force-presence] [--presence-threshold <n>]
-              [-T <threads>] <query.fa> [<query2.fa> ...]
+              [-z <z>] [-T <threads>]
+              <query.fa> [<query2.fa> ...]
 ```

-`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr.
+| Option | Default | Semantics |
+|---|---|---|
+| `-z` / `--findere-z` | from index metadata | Override Findere z parameter |
+| `--detail` | off | Emit per-position coverage vectors in JSON |
+| `--count-missing` | off | Add `kmer_missing` field to JSON |
+| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
+| `--presence-threshold` | 1 | Minimum count to declare genome present |
+| `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) |
+
+`--mismatch` is accepted but currently ignored with a warning on stderr.

 ---

 ## Future work

 - **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
- **`--detail`**: per-position coverage vectors (`cov_<i>`) per genome.
 - **Read classification** (`--classify`): assign each read to the genome with the highest match score.
 - **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
 - **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.