docs: clarify query pipeline, Findere trick, and input formats

Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
This commit is contained in:
Eric Coissac
2026-05-30 15:54:13 +02:00
parent 708b0abf9b
commit 8a0b898b4b
4 changed files with 150 additions and 36 deletions
+13 -10
View File
@@ -35,15 +35,18 @@ stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
P(FP per k-mer) = 1 / 2^b
```
The Findere trick raises the effective window to z consecutive k-mers. A query
succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
The Findere trick reduces the indexed k-mer size. When the user specifies k_user
and z, the index physically stores k-mers of size `s = k_user z + 1`. At query
time, the same s-mer size is used. After collecting per-position s-mer results
over the full query sequence, a sliding window of size z aggregates z consecutive
s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:
```
P(FP per z-window) = 1 / 2^(b·z)
P(FP per k_user-mer) = 1 / 2^(b·z)
```
The effective indexed k-mer length is `k z + 1`: a query for a (k+z1)-mer
decomposes into z overlapping k-mers, all of which must match.
`IndexConfig::kmer_size` stores `s = k_user z + 1`, not k_user. Both indexing
and querying use this stored size via `set_k(idx.kmer_size())`.
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
@@ -167,12 +170,12 @@ any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
additionally accepts `-k` to display the effective indexed k-mer length:
```
k (query): 31
k (indexed): 31
z: 1
k (user): 31
k (indexed, s=k-z+1): 27
z: 5
evidence bits (b): 8
FP per k-mer: 3.906e-3 (1/2^8)
FP per z-window: 3.906e-3 (1/2^8)
FP per s-mer: 3.906e-3 (1/2^8)
FP per k-mer window: 9.537e-7 (1/2^(8·5))
```
Useful for choosing parameters before committing to an index build.