refactor: centralize k-mer filtering logic and add validation

Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
This commit is contained in:
Eric Coissac
2026-06-09 09:57:38 +02:00
parent 2465cfbc4b
commit ce45e2fbe1
4 changed files with 98 additions and 34 deletions
@@ -1,9 +1,10 @@
# Kmer filtering and ingroup/outgroup predicates # Kmer filtering and ingroup/outgroup predicates
The `rebuild`, `dump`, and `unitig` commands all share the same filtering The `rebuild`, `dump`, and `unitig` commands share the same filtering system,
system. Filters can select k-mers based on per-genome quorum counts, optionally implemented as a shared `FilterArgs` clap argument group embedded in each command
restricted to **ingroup** and **outgroup** genome sets derived from genome via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum
metadata. counts, optionally restricted to **ingroup** and **outgroup** genome sets derived
from genome metadata. All rules described here apply identically to all three commands.
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters `rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
that operate on the sum of counts across all genomes. that operate on the sum of counts across all genomes.
@@ -96,16 +97,44 @@ For each genome:
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) | | `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) | | `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
**Conditional defaults** — the default for `--min-frac` and `--max-outgroup-count` depends on whether the corresponding group was explicitly declared: **Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
whether the corresponding group was declared, **and** whether any quorum flag for that group was explicitly set.
| Situation | `--min-frac` default | `--max-outgroup-count` default | > **Rule**: declaring a group activates the smart default **only if no quorum flag for that group is explicitly set**.
|-----------|----------------------|-------------------------------| > As soon as any quorum flag for a group is present on the command line, all defaults for that group revert to no-op values.
| Neither `--ingroup` nor `--outgroup` | 0.0 (no-op) | no constraint (no-op) |
| `--ingroup` only | **1.0** — all ingroup genomes must carry the k-mer | no constraint |
| `--outgroup` only | 0.0 | **0** — no outgroup genome may carry the k-mer |
| Both declared | **1.0** | **0** |
Explicit flags always override these defaults. All other bounds (`--min-count`, `--max-count`, `--max-frac`, `--min-outgroup-*`) default to 0 / group size / 1.0 regardless of whether groups are declared. | `--ingroup` | Any ingroup quorum flag? | `--min-frac` default |
|-------------|--------------------------|----------------------|
| not set | — | 0.0 (no-op) |
| set | no | **1.0** — all ingroup genomes must carry the k-mer |
| set | yes | 0.0 — user controls quorum explicitly |
| `--outgroup` | Any outgroup quorum flag? | `--max-outgroup-count` default |
|--------------|---------------------------|-------------------------------|
| not set | — | outgroup size (no-op) |
| set | no | **0** — no outgroup genome may carry the k-mer |
| set | yes | outgroup size — user controls quorum explicitly |
"Any ingroup quorum flag" means any of: `--min-count`, `--max-count`, `--min-frac`, `--max-frac`.
"Any outgroup quorum flag" means any of: `--min-outgroup-count`, `--max-outgroup-count`, `--min-outgroup-frac`, `--max-outgroup-frac`.
**Why this rule?** Setting any quorum flag signals explicit intent — the defaults are there to help when the user omits quorum entirely, not to interfere with deliberate constraints. Mixing implicit and explicit quorum on the same group would risk silent incoherence (e.g. `--max-count 0` with an implicit `--min-frac 1.0`).
All other bounds default to 0 / group size / 0.0 / 1.0 regardless of whether groups are declared.
### Validation
After resolving defaults, the following are checked and cause an immediate error:
| Condition | Error |
|-----------|-------|
| `--min-count > --max-count` | incoherent bounds |
| `--min-frac > --max-frac` | incoherent bounds |
| `--min-outgroup-count > --max-outgroup-count` | incoherent bounds |
| `--min-outgroup-frac > --max-outgroup-frac` | incoherent bounds |
| any fraction outside `[0.0, 1.0]` | invalid value |
The check applies to the **effective** values (after defaults are resolved), so an explicit `--max-frac 0.5` with an implicit `--min-frac 1.0` would have been caught — but the rule above prevents that situation from arising in the first place.
Fractions are computed over the size of the classified group, not over total Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never genome count. An empty group (no genome classified as ingroup/outgroup) never
@@ -156,8 +185,7 @@ obikmer rebuild src --output dst \
--max-outgroup-count 0 --max-outgroup-count 0
``` ```
The same flags work identically for `dump` and `unitig`. To dump only k-mers To dump only k-mers specific to *Betula nana*:
specific to *Betula nana*:
```sh ```sh
obikmer dump myindex \ obikmer dump myindex \
+3 -3
View File
@@ -9,12 +9,12 @@
| `superkmer` | Extract super-kmers from a sequence file and write to stdout | | `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) | | `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one | | `merge` | Merge multiple built indexes into one |
| `rebuild` | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata | | `rebuild` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system |
| `query` | Query an index with sequences and annotate matches | | `query` | Query an index with sequences and annotate matches |
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild`; `--head N` limits output to the first N k-mers | | `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV | | `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) | | `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` | | `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing | | `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx | | `reindex` | Convert an index's evidence in-place: exact ↔ approx |
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes | | `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
+1 -1
View File
@@ -49,7 +49,7 @@ nav:
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md - PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
- PersistentBitVec: implementation/persistent_bit_vec.md - PersistentBitVec: implementation/persistent_bit_vec.md
- Merge command: implementation/merge.md - Merge command: implementation/merge.md
- Kmer filtering (rebuild/dump/unitig): implementation/rebuild_filter.md - Kmer filtering: implementation/filtering.md
- Architecture: - Architecture:
- Sequences: architecture/sequences/invariant.md - Sequences: architecture/sequences/invariant.md
- Kmer index: architecture/index_architecture.md - Kmer index: architecture/index_architecture.md
+51 -15
View File
@@ -207,7 +207,7 @@ impl FilterArgs {
std::process::exit(1); std::process::exit(1);
})) }))
.collect(); .collect();
vec![Box::new(build_group_filter( let filter = build_group_filter(
genomes, genomes,
&ingroup_preds, &ingroup_preds,
&outgroup_preds, &outgroup_preds,
@@ -222,7 +222,11 @@ impl FilterArgs {
min_outgroup_frac: self.min_outgroup_frac, min_outgroup_frac: self.min_outgroup_frac,
max_outgroup_frac: self.max_outgroup_frac, max_outgroup_frac: self.max_outgroup_frac,
}, },
))] ).unwrap_or_else(|e| {
eprintln!("error in filter parameters: {e}");
std::process::exit(1);
});
vec![Box::new(filter)]
} }
} }
@@ -243,7 +247,7 @@ pub fn build_group_filter(
ingroup_preds: &[MetaPred], ingroup_preds: &[MetaPred],
outgroup_preds: &[MetaPred], outgroup_preds: &[MetaPred],
p: GroupFilterParams, p: GroupFilterParams,
) -> GroupQuorumFilter { ) -> Result<GroupQuorumFilter, String> {
let (ingroup_idx, outgroup_idx) = if ingroup_preds.is_empty() && outgroup_preds.is_empty() { let (ingroup_idx, outgroup_idx) = if ingroup_preds.is_empty() && outgroup_preds.is_empty() {
((0..genomes.len()).collect(), vec![]) ((0..genomes.len()).collect(), vec![])
} else { } else {
@@ -260,20 +264,52 @@ pub fn build_group_filter(
let in_size = ingroup_idx.len(); let in_size = ingroup_idx.len();
let out_size = outgroup_idx.len(); let out_size = outgroup_idx.len();
let default_min_frac = if !ingroup_preds.is_empty() { 1.0 } else { 0.0 }; let ingroup_quorum_explicit = p.min_count.is_some() || p.max_count.is_some()
let default_max_outgroup_count = if !outgroup_preds.is_empty() { 0 } else { out_size }; || p.min_frac.is_some() || p.max_frac.is_some();
let outgroup_quorum_explicit = p.min_outgroup_count.is_some() || p.max_outgroup_count.is_some()
|| p.min_outgroup_frac.is_some() || p.max_outgroup_frac.is_some();
GroupQuorumFilter { let default_min_frac = if !ingroup_preds.is_empty() && !ingroup_quorum_explicit { 1.0 } else { 0.0 };
let default_max_outgroup_count = if !outgroup_preds.is_empty() && !outgroup_quorum_explicit { 0 } else { out_size };
let min_count = p.min_count.unwrap_or(0);
let max_count = p.max_count.unwrap_or(in_size);
let min_frac = p.min_frac.unwrap_or(default_min_frac);
let max_frac = p.max_frac.unwrap_or(1.0);
let min_outgroup_count = p.min_outgroup_count.unwrap_or(0);
let max_outgroup_count = p.max_outgroup_count.unwrap_or(default_max_outgroup_count);
let min_outgroup_frac = p.min_outgroup_frac.unwrap_or(0.0);
let max_outgroup_frac = p.max_outgroup_frac.unwrap_or(1.0);
for (v, lo, hi) in [
("--min-frac/--max-frac", min_frac, max_frac),
("--min-outgroup-frac/--max-outgroup-frac", min_outgroup_frac, max_outgroup_frac),
] {
if !(0.0..=1.0).contains(&lo) || !(0.0..=1.0).contains(&hi) {
return Err(format!("{v}: fraction values must be in [0.0, 1.0]"));
}
if lo > hi {
return Err(format!("{v}: min ({lo}) is greater than max ({hi})"));
}
}
if min_count > max_count {
return Err(format!("--min-count/--max-count: min ({min_count}) is greater than max ({max_count})"));
}
if min_outgroup_count > max_outgroup_count {
return Err(format!("--min-outgroup-count/--max-outgroup-count: min ({min_outgroup_count}) is greater than max ({max_outgroup_count})"));
}
Ok(GroupQuorumFilter {
ingroup_idx, ingroup_idx,
outgroup_idx, outgroup_idx,
threshold: p.threshold, threshold: p.threshold,
min_count: p.min_count.unwrap_or(0), min_count,
max_count: p.max_count.unwrap_or(in_size), max_count,
min_frac: p.min_frac.unwrap_or(default_min_frac), min_frac,
max_frac: p.max_frac.unwrap_or(1.0), max_frac,
min_outgroup_count: p.min_outgroup_count.unwrap_or(0), min_outgroup_count,
max_outgroup_count: p.max_outgroup_count.unwrap_or(default_max_outgroup_count), max_outgroup_count,
min_outgroup_frac: p.min_outgroup_frac.unwrap_or(0.0), min_outgroup_frac,
max_outgroup_frac: p.max_outgroup_frac.unwrap_or(1.0), max_outgroup_frac,
} })
} }