docs: document conservative rounding strategy for filtering thresholds

Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.
This commit is contained in:
Eric Coissac
2026-06-09 10:24:25 +02:00
parent ce45e2fbe1
commit 7dd8db1409
+23
View File
@@ -140,6 +140,29 @@ Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure. triggers a filter failure.
### Conservative rounding of fraction thresholds
When a fraction threshold `F` is applied to a group of size `N`, the effective
integer threshold is determined by the direction of the bound:
| Bound | Effective count | Rounding | Rationale |
|-------|----------------|----------|-----------|
| `--min-frac F` | k-mer in ≥ ⌈F·N⌉ genomes | **ceil** | stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction |
| `--max-frac F` | k-mer in ≤ ⌊F·N⌋ genomes | **floor** | stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction |
The same rule applies symmetrically to `--min-outgroup-frac` (ceil) and
`--max-outgroup-frac` (floor). The outgroup direction is not inverted: the
conservative rounding depends only on whether the bound is a minimum or a
maximum, not on which group it applies to.
**Example**`--min-frac 0.5` with an ingroup of 3 genomes:
`⌈0.5 × 3⌉ = ⌈1.5⌉ = 2` → at least 2 of 3 ingroup genomes must carry the k-mer.
**Implementation note** — the filter evaluates `n / denom < min_frac` directly
(integer `n`, float comparison) rather than pre-computing `⌈F·N⌉`. This is
mathematically equivalent for integer counts: `n / N < F``n < F·N`
`n ≤ ⌈F·N⌉ 1``n < ⌈F·N⌉`. No explicit rounding is needed.
## Examples ## Examples
Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes