refactor: centralize k-mer filtering logic and add validation

Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
2026-06-09 09:57:38 +02:00
parent 2465cfbc4b
commit ce45e2fbe1
4 changed files with 98 additions and 34 deletions
@@ -9,12 +9,12 @@
 | `superkmer` | Extract super-kmers from a sequence file and write to stdout |
 | `index`     | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
 | `merge`     | Merge multiple built indexes into one |
-| `rebuild`   | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata |
+| `rebuild`   | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system |
 | `query`     | Query an index with sequences and annotate matches |
-| `dump`      | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild`; `--head N` limits output to the first N k-mers |
+| `dump`      | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
 | `annotate`  | Add or update genome metadata from a CSV file; or dump metadata as CSV |
 | `distance`  | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
-| `unitig`    | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` |
+| `unitig`    | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared [kmer filtering](implementation/filtering.md) system |
 | `estimate`  | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
 | `reindex`   | Convert an index's evidence in-place: exact ↔ approx |
 | `utils`     | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |