diff --git a/docmd/implementation/filtering.md b/docmd/implementation/filtering.md index 795d8fd..4dfab31 100644 --- a/docmd/implementation/filtering.md +++ b/docmd/implementation/filtering.md @@ -1,12 +1,12 @@ # Kmer filtering and ingroup/outgroup predicates -The `rebuild`, `dump`, and `unitig` commands share the same filtering system, +The `filter`, `dump`, and `unitig` commands share the same filtering system, implemented as a shared `FilterArgs` clap argument group embedded in each command via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum counts, optionally restricted to **ingroup** and **outgroup** genome sets derived from genome metadata. All rules described here apply identically to all three commands. -`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters +`filter` additionally accepts `--min-total-count` / `--max-total-count` filters that operate on the sum of counts across all genomes. ## Predicate syntax @@ -93,8 +93,8 @@ For each genome: | `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes | | `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes | | `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes | -| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) | -| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) | +| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`filter` only) | +| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`filter` only) | | `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) | **Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions: @@ -169,7 +169,7 @@ Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genome and absent from every other genome in the index: ```sh -obikmer rebuild src --output dst \ +obikmer filter src --output dst \ --ingroup "species=Betula_nana" \ --outgroup "*" \ --min-count 2 \ @@ -180,7 +180,7 @@ Keep k-mers found in at least 2 *Betula nana* genomes and absent from all other *Betula*: ```sh -obikmer rebuild src --output dst \ +obikmer filter src --output dst \ --ingroup "species=Betula_nana" \ --outgroup "genus=Betula" \ --min-count 2 \ @@ -191,7 +191,7 @@ Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade and in fewer than 10 % of everything outside *Betulaceae*: ```sh -obikmer rebuild src --output dst \ +obikmer filter src --output dst \ --ingroup "taxon~/Betulaceae/Betula" \ --outgroup "taxon!~/Betulaceae" \ --min-frac 0.5 \ @@ -201,7 +201,7 @@ obikmer rebuild src --output dst \ Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*: ```sh -obikmer rebuild src --output dst \ +obikmer filter src --output dst \ --ingroup "genus=Betula" \ --outgroup "genus=Alnus" \ --outgroup "genus=Carpinus" \ @@ -244,7 +244,7 @@ obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1 ### `distance --presence-threshold N` When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1). -This option is independent of the `--presence-threshold` used in `rebuild`/`query` filtering. +This option is independent of the `--presence-threshold` used in filtering. ```sh # Jaccard treating kmers with count ≥ 2 as present @@ -262,11 +262,11 @@ This parameter has no effect on presence/absence indexes (where values are alrea lookup and counter. - **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group - embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and + embedded via `#[command(flatten)]` in `FilterArgs`, `DumpArgs`, and `UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter list. - **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts `filters: &[Box]` and applies them per-kmer before invoking - the callback. `rebuild`, `dump`, and `unitig` all go through this single + the callback. `filter`, `dump`, and `unitig` all go through this single entry point. diff --git a/docmd/implementation/select.md b/docmd/implementation/select.md index 8fbb022..d78a3fd 100644 --- a/docmd/implementation/select.md +++ b/docmd/implementation/select.md @@ -56,8 +56,8 @@ first, then renames atomically, so an interrupted run leaves the index intact. --group : ``` -Defines a named group of genomes using the same predicate syntax as -`filter`/`rebuild`. Repeatable; a genome can belong to several groups. +Defines a named group of genomes using the same predicate syntax as `filter`. +Repeatable; a genome can belong to several groups. ```sh --group "pub:species=Betula_pubescens" diff --git a/docmd/index.md b/docmd/index.md index e2d96ca..9f6d650 100644 --- a/docmd/index.md +++ b/docmd/index.md @@ -9,7 +9,7 @@ | `superkmer` | Extract super-kmers from a sequence file and write to stdout | | `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) | | `merge` | Merge multiple built indexes into one | -| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) | +| `filter` | Apply row-level selection (σ) to an index: retain only k-mers matching the ingroup/outgroup predicates. Output is a new single-layer index — compaction is a consequence, not the goal. Supports the shared [kmer filtering](implementation/filtering.md) system | | `query` | Query an index with sequences and annotate matches | | `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers | | `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV | diff --git a/src/obikmer/src/cmd/rebuild.rs b/src/obikmer/src/cmd/filter.rs similarity index 84% rename from src/obikmer/src/cmd/rebuild.rs rename to src/obikmer/src/cmd/filter.rs index 594d5f4..993f80a 100644 --- a/src/obikmer/src/cmd/rebuild.rs +++ b/src/obikmer/src/cmd/filter.rs @@ -6,10 +6,10 @@ use obikpartitionner::filter::{MaxTotalCount, MinTotalCount}; use obisys::Reporter; use tracing::info; -use super::predicate::FilterArgs; +use super::predicate::FilterArgs as KmerFilterArgs; #[derive(Args)] -pub struct RebuildArgs { +pub struct FilterArgs { /// Source index directory pub source: PathBuf, @@ -18,7 +18,7 @@ pub struct RebuildArgs { pub output: PathBuf, #[command(flatten)] - pub filter: FilterArgs, + pub filter: KmerFilterArgs, /// Minimum total count across all genomes (count index only) #[arg(long)] @@ -37,7 +37,7 @@ pub struct RebuildArgs { pub force: bool, } -pub fn run(args: RebuildArgs) { +pub fn run(args: FilterArgs) { let src = KmerIndex::open(&args.source).unwrap_or_else(|e| { eprintln!("error opening source index: {e}"); std::process::exit(1); @@ -50,7 +50,7 @@ pub fn run(args: RebuildArgs) { }; info!( - "rebuild: {} genome(s), mode={:?}, source={}", + "filter: {} genome(s), mode={:?}, source={}", &src.meta().genomes.len(), mode, args.source.display() ); @@ -66,10 +66,10 @@ pub fn run(args: RebuildArgs) { let mut rep = Reporter::new(); KmerIndex::rebuild(&args.output, &src, &filters, mode, args.force, &mut rep) .unwrap_or_else(|e| { - eprintln!("error rebuilding index: {e}"); + eprintln!("error filtering index: {e}"); std::process::exit(1); }); rep.print(); - info!("rebuilt index → {}", args.output.display()); + info!("filtered index → {}", args.output.display()); } diff --git a/src/obikmer/src/cmd/mod.rs b/src/obikmer/src/cmd/mod.rs index 1c1c0a6..eba674b 100644 --- a/src/obikmer/src/cmd/mod.rs +++ b/src/obikmer/src/cmd/mod.rs @@ -1,4 +1,5 @@ pub mod annotate; +pub mod filter; pub mod pack; pub(crate) mod predicate; pub mod select; @@ -9,7 +10,6 @@ pub mod estimate; pub mod index; pub mod merge; pub mod query; -pub mod rebuild; pub mod reindex; pub mod superkmer; pub mod unitig; diff --git a/src/obikmer/src/main.rs b/src/obikmer/src/main.rs index 224dd92..fdcf69c 100644 --- a/src/obikmer/src/main.rs +++ b/src/obikmer/src/main.rs @@ -20,10 +20,8 @@ enum Commands { Index(cmd::index::IndexArgs), /// Merge multiple built indexes into one Merge(cmd::merge::MergeArgs), - /// Filter and compact an existing index into a new single-layer index - Rebuild(cmd::rebuild::RebuildArgs), - /// Alias for rebuild - Filter(cmd::rebuild::RebuildArgs), + /// Apply row-level selection (σ) to an index: retain only k-mers matching the predicates + Filter(cmd::filter::FilterArgs), /// Project and/or aggregate genome columns into a new or in-place index Select(cmd::select::SelectArgs), /// Query an index with sequences and annotate matches @@ -69,8 +67,7 @@ fn main() { Commands::Index(args) => cmd::index::run(args), Commands::Merge(args) => cmd::merge::run(args), Commands::Dump(args) => cmd::dump::run(args), - Commands::Rebuild(args) => cmd::rebuild::run(args), - Commands::Filter(args) => cmd::rebuild::run(args), + Commands::Filter(args) => cmd::filter::run(args), Commands::Select(args) => cmd::select::run(args), Commands::Query(args) => cmd::query::run(args), Commands::Annotate(args) => cmd::annotate::run(args),