From 7dd8db1409f32a587c5eac7d09148a38b6aa459e Mon Sep 17 00:00:00 2001 From: Eric Coissac Date: Tue, 9 Jun 2026 10:24:25 +0200 Subject: [PATCH] docs: document conservative rounding strategy for filtering thresholds Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent. --- docmd/implementation/filtering.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docmd/implementation/filtering.md b/docmd/implementation/filtering.md index 3ca4ca6..795d8fd 100644 --- a/docmd/implementation/filtering.md +++ b/docmd/implementation/filtering.md @@ -140,6 +140,29 @@ Fractions are computed over the size of the classified group, not over total genome count. An empty group (no genome classified as ingroup/outgroup) never triggers a filter failure. +### Conservative rounding of fraction thresholds + +When a fraction threshold `F` is applied to a group of size `N`, the effective +integer threshold is determined by the direction of the bound: + +| Bound | Effective count | Rounding | Rationale | +|-------|----------------|----------|-----------| +| `--min-frac F` | k-mer in ≥ ⌈F·N⌉ genomes | **ceil** | stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction | +| `--max-frac F` | k-mer in ≤ ⌊F·N⌋ genomes | **floor** | stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction | + +The same rule applies symmetrically to `--min-outgroup-frac` (ceil) and +`--max-outgroup-frac` (floor). The outgroup direction is not inverted: the +conservative rounding depends only on whether the bound is a minimum or a +maximum, not on which group it applies to. + +**Example** — `--min-frac 0.5` with an ingroup of 3 genomes: +`⌈0.5 × 3⌉ = ⌈1.5⌉ = 2` → at least 2 of 3 ingroup genomes must carry the k-mer. + +**Implementation note** — the filter evaluates `n / denom < min_frac` directly +(integer `n`, float comparison) rather than pre-computing `⌈F·N⌉`. This is +mathematically equivalent for integer counts: `n / N < F` ↔ `n < F·N` ↔ +`n ≤ ⌈F·N⌉ − 1` ↔ `n < ⌈F·N⌉`. No explicit rounding is needed. + ## Examples Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes