refactor: centralize k-mer filtering logic and add validation
Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
This commit is contained in:
@@ -1,9 +1,10 @@
|
||||
# Kmer filtering and ingroup/outgroup predicates
|
||||
|
||||
The `rebuild`, `dump`, and `unitig` commands all share the same filtering
|
||||
system. Filters can select k-mers based on per-genome quorum counts, optionally
|
||||
restricted to **ingroup** and **outgroup** genome sets derived from genome
|
||||
metadata.
|
||||
The `rebuild`, `dump`, and `unitig` commands share the same filtering system,
|
||||
implemented as a shared `FilterArgs` clap argument group embedded in each command
|
||||
via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum
|
||||
counts, optionally restricted to **ingroup** and **outgroup** genome sets derived
|
||||
from genome metadata. All rules described here apply identically to all three commands.
|
||||
|
||||
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
|
||||
that operate on the sum of counts across all genomes.
|
||||
@@ -96,16 +97,44 @@ For each genome:
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
**Conditional defaults** — the default for `--min-frac` and `--max-outgroup-count` depends on whether the corresponding group was explicitly declared:
|
||||
**Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
|
||||
whether the corresponding group was declared, **and** whether any quorum flag for that group was explicitly set.
|
||||
|
||||
| Situation | `--min-frac` default | `--max-outgroup-count` default |
|
||||
|-----------|----------------------|-------------------------------|
|
||||
| Neither `--ingroup` nor `--outgroup` | 0.0 (no-op) | no constraint (no-op) |
|
||||
| `--ingroup` only | **1.0** — all ingroup genomes must carry the k-mer | no constraint |
|
||||
| `--outgroup` only | 0.0 | **0** — no outgroup genome may carry the k-mer |
|
||||
| Both declared | **1.0** | **0** |
|
||||
> **Rule**: declaring a group activates the smart default **only if no quorum flag for that group is explicitly set**.
|
||||
> As soon as any quorum flag for a group is present on the command line, all defaults for that group revert to no-op values.
|
||||
|
||||
Explicit flags always override these defaults. All other bounds (`--min-count`, `--max-count`, `--max-frac`, `--min-outgroup-*`) default to 0 / group size / 1.0 regardless of whether groups are declared.
|
||||
| `--ingroup` | Any ingroup quorum flag? | `--min-frac` default |
|
||||
|-------------|--------------------------|----------------------|
|
||||
| not set | — | 0.0 (no-op) |
|
||||
| set | no | **1.0** — all ingroup genomes must carry the k-mer |
|
||||
| set | yes | 0.0 — user controls quorum explicitly |
|
||||
|
||||
| `--outgroup` | Any outgroup quorum flag? | `--max-outgroup-count` default |
|
||||
|--------------|---------------------------|-------------------------------|
|
||||
| not set | — | outgroup size (no-op) |
|
||||
| set | no | **0** — no outgroup genome may carry the k-mer |
|
||||
| set | yes | outgroup size — user controls quorum explicitly |
|
||||
|
||||
"Any ingroup quorum flag" means any of: `--min-count`, `--max-count`, `--min-frac`, `--max-frac`.
|
||||
"Any outgroup quorum flag" means any of: `--min-outgroup-count`, `--max-outgroup-count`, `--min-outgroup-frac`, `--max-outgroup-frac`.
|
||||
|
||||
**Why this rule?** Setting any quorum flag signals explicit intent — the defaults are there to help when the user omits quorum entirely, not to interfere with deliberate constraints. Mixing implicit and explicit quorum on the same group would risk silent incoherence (e.g. `--max-count 0` with an implicit `--min-frac 1.0`).
|
||||
|
||||
All other bounds default to 0 / group size / 0.0 / 1.0 regardless of whether groups are declared.
|
||||
|
||||
### Validation
|
||||
|
||||
After resolving defaults, the following are checked and cause an immediate error:
|
||||
|
||||
| Condition | Error |
|
||||
|-----------|-------|
|
||||
| `--min-count > --max-count` | incoherent bounds |
|
||||
| `--min-frac > --max-frac` | incoherent bounds |
|
||||
| `--min-outgroup-count > --max-outgroup-count` | incoherent bounds |
|
||||
| `--min-outgroup-frac > --max-outgroup-frac` | incoherent bounds |
|
||||
| any fraction outside `[0.0, 1.0]` | invalid value |
|
||||
|
||||
The check applies to the **effective** values (after defaults are resolved), so an explicit `--max-frac 0.5` with an implicit `--min-frac 1.0` would have been caught — but the rule above prevents that situation from arising in the first place.
|
||||
|
||||
Fractions are computed over the size of the classified group, not over total
|
||||
genome count. An empty group (no genome classified as ingroup/outgroup) never
|
||||
@@ -156,8 +185,7 @@ obikmer rebuild src --output dst \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
The same flags work identically for `dump` and `unitig`. To dump only k-mers
|
||||
specific to *Betula nana*:
|
||||
To dump only k-mers specific to *Betula nana*:
|
||||
|
||||
```sh
|
||||
obikmer dump myindex \
|
||||
Reference in New Issue
Block a user