diff --git a/docmd/implementation/filtering.md b/docmd/implementation/filtering.md index 3ca4ca6..795d8fd 100644 --- a/docmd/implementation/filtering.md +++ b/docmd/implementation/filtering.md @@ -140,6 +140,29 @@ Fractions are computed over the size of the classified group, not over total genome count. An empty group (no genome classified as ingroup/outgroup) never triggers a filter failure. +### Conservative rounding of fraction thresholds + +When a fraction threshold `F` is applied to a group of size `N`, the effective +integer threshold is determined by the direction of the bound: + +| Bound | Effective count | Rounding | Rationale | +|-------|----------------|----------|-----------| +| `--min-frac F` | k-mer in ≥ ⌈F·N⌉ genomes | **ceil** | stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction | +| `--max-frac F` | k-mer in ≤ ⌊F·N⌋ genomes | **floor** | stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction | + +The same rule applies symmetrically to `--min-outgroup-frac` (ceil) and +`--max-outgroup-frac` (floor). The outgroup direction is not inverted: the +conservative rounding depends only on whether the bound is a minimum or a +maximum, not on which group it applies to. + +**Example** — `--min-frac 0.5` with an ingroup of 3 genomes: +`⌈0.5 × 3⌉ = ⌈1.5⌉ = 2` → at least 2 of 3 ingroup genomes must carry the k-mer. + +**Implementation note** — the filter evaluates `n / denom < min_frac` directly +(integer `n`, float comparison) rather than pre-computing `⌈F·N⌉`. This is +mathematically equivalent for integer counts: `n / N < F` ↔ `n < F·N` ↔ +`n ≤ ⌈F·N⌉ − 1` ↔ `n < ⌈F·N⌉`. No explicit rounding is needed. + ## Examples Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes