Push tmpsxsztwpxl #21

Merged
coissac merged 2 commits from push-tmpsxsztwpxl into main 2026-06-09 13:31:25 +00:00
6 changed files with 25 additions and 28 deletions
Showing only changes of commit 970460be42 - Show all commits
+11 -11
View File
@@ -1,12 +1,12 @@
# Kmer filtering and ingroup/outgroup predicates # Kmer filtering and ingroup/outgroup predicates
The `rebuild`, `dump`, and `unitig` commands share the same filtering system, The `filter`, `dump`, and `unitig` commands share the same filtering system,
implemented as a shared `FilterArgs` clap argument group embedded in each command implemented as a shared `FilterArgs` clap argument group embedded in each command
via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum
counts, optionally restricted to **ingroup** and **outgroup** genome sets derived counts, optionally restricted to **ingroup** and **outgroup** genome sets derived
from genome metadata. All rules described here apply identically to all three commands. from genome metadata. All rules described here apply identically to all three commands.
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters `filter` additionally accepts `--min-total-count` / `--max-total-count` filters
that operate on the sum of counts across all genomes. that operate on the sum of counts across all genomes.
## Predicate syntax ## Predicate syntax
@@ -93,8 +93,8 @@ For each genome:
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes | | `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes | | `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes | | `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) | | `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`filter` only) |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) | | `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`filter` only) |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) | | `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
**Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions: **Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
@@ -169,7 +169,7 @@ Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genome
and absent from every other genome in the index: and absent from every other genome in the index:
```sh ```sh
obikmer rebuild src --output dst \ obikmer filter src --output dst \
--ingroup "species=Betula_nana" \ --ingroup "species=Betula_nana" \
--outgroup "*" \ --outgroup "*" \
--min-count 2 \ --min-count 2 \
@@ -180,7 +180,7 @@ Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
other *Betula*: other *Betula*:
```sh ```sh
obikmer rebuild src --output dst \ obikmer filter src --output dst \
--ingroup "species=Betula_nana" \ --ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \ --outgroup "genus=Betula" \
--min-count 2 \ --min-count 2 \
@@ -191,7 +191,7 @@ Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
and in fewer than 10 % of everything outside *Betulaceae*: and in fewer than 10 % of everything outside *Betulaceae*:
```sh ```sh
obikmer rebuild src --output dst \ obikmer filter src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \ --ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \ --outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \ --min-frac 0.5 \
@@ -201,7 +201,7 @@ obikmer rebuild src --output dst \
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*: Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
```sh ```sh
obikmer rebuild src --output dst \ obikmer filter src --output dst \
--ingroup "genus=Betula" \ --ingroup "genus=Betula" \
--outgroup "genus=Alnus" \ --outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \ --outgroup "genus=Carpinus" \
@@ -244,7 +244,7 @@ obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1
### `distance --presence-threshold N` ### `distance --presence-threshold N`
When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1). When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1).
This option is independent of the `--presence-threshold` used in `rebuild`/`query` filtering. This option is independent of the `--presence-threshold` used in filtering.
```sh ```sh
# Jaccard treating kmers with count ≥ 2 as present # Jaccard treating kmers with count ≥ 2 as present
@@ -262,11 +262,11 @@ This parameter has no effect on presence/absence indexes (where values are alrea
lookup and counter. lookup and counter.
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group - **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and embedded via `#[command(flatten)]` in `FilterArgs`, `DumpArgs`, and
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter `UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
list. list.
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts - **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking `filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
the callback. `rebuild`, `dump`, and `unitig` all go through this single the callback. `filter`, `dump`, and `unitig` all go through this single
entry point. entry point.
+2 -2
View File
@@ -56,8 +56,8 @@ first, then renames atomically, so an interrupted run leaves the index intact.
--group <name>:<pred> --group <name>:<pred>
``` ```
Defines a named group of genomes using the same predicate syntax as Defines a named group of genomes using the same predicate syntax as `filter`.
`filter`/`rebuild`. Repeatable; a genome can belong to several groups. Repeatable; a genome can belong to several groups.
```sh ```sh
--group "pub:species=Betula_pubescens" --group "pub:species=Betula_pubescens"
+1 -1
View File
@@ -9,7 +9,7 @@
| `superkmer` | Extract super-kmers from a sequence file and write to stdout | | `superkmer` | Extract super-kmers from a sequence file and write to stdout |
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) | | `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
| `merge` | Merge multiple built indexes into one | | `merge` | Merge multiple built indexes into one |
| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) | | `filter` | Apply row-level selection (σ) to an index: retain only k-mers matching the ingroup/outgroup predicates. Output is a new single-layer index — compaction is a consequence, not the goal. Supports the shared [kmer filtering](implementation/filtering.md) system |
| `query` | Query an index with sequences and annotate matches | | `query` | Query an index with sequences and annotate matches |
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers | | `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV | | `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
@@ -6,10 +6,10 @@ use obikpartitionner::filter::{MaxTotalCount, MinTotalCount};
use obisys::Reporter; use obisys::Reporter;
use tracing::info; use tracing::info;
use super::predicate::FilterArgs; use super::predicate::FilterArgs as KmerFilterArgs;
#[derive(Args)] #[derive(Args)]
pub struct RebuildArgs { pub struct FilterArgs {
/// Source index directory /// Source index directory
pub source: PathBuf, pub source: PathBuf,
@@ -18,7 +18,7 @@ pub struct RebuildArgs {
pub output: PathBuf, pub output: PathBuf,
#[command(flatten)] #[command(flatten)]
pub filter: FilterArgs, pub filter: KmerFilterArgs,
/// Minimum total count across all genomes (count index only) /// Minimum total count across all genomes (count index only)
#[arg(long)] #[arg(long)]
@@ -37,7 +37,7 @@ pub struct RebuildArgs {
pub force: bool, pub force: bool,
} }
pub fn run(args: RebuildArgs) { pub fn run(args: FilterArgs) {
let src = KmerIndex::open(&args.source).unwrap_or_else(|e| { let src = KmerIndex::open(&args.source).unwrap_or_else(|e| {
eprintln!("error opening source index: {e}"); eprintln!("error opening source index: {e}");
std::process::exit(1); std::process::exit(1);
@@ -50,7 +50,7 @@ pub fn run(args: RebuildArgs) {
}; };
info!( info!(
"rebuild: {} genome(s), mode={:?}, source={}", "filter: {} genome(s), mode={:?}, source={}",
&src.meta().genomes.len(), mode, args.source.display() &src.meta().genomes.len(), mode, args.source.display()
); );
@@ -66,10 +66,10 @@ pub fn run(args: RebuildArgs) {
let mut rep = Reporter::new(); let mut rep = Reporter::new();
KmerIndex::rebuild(&args.output, &src, &filters, mode, args.force, &mut rep) KmerIndex::rebuild(&args.output, &src, &filters, mode, args.force, &mut rep)
.unwrap_or_else(|e| { .unwrap_or_else(|e| {
eprintln!("error rebuilding index: {e}"); eprintln!("error filtering index: {e}");
std::process::exit(1); std::process::exit(1);
}); });
rep.print(); rep.print();
info!("rebuilt index → {}", args.output.display()); info!("filtered index → {}", args.output.display());
} }
+1 -1
View File
@@ -1,4 +1,5 @@
pub mod annotate; pub mod annotate;
pub mod filter;
pub mod pack; pub mod pack;
pub(crate) mod predicate; pub(crate) mod predicate;
pub mod select; pub mod select;
@@ -9,7 +10,6 @@ pub mod estimate;
pub mod index; pub mod index;
pub mod merge; pub mod merge;
pub mod query; pub mod query;
pub mod rebuild;
pub mod reindex; pub mod reindex;
pub mod superkmer; pub mod superkmer;
pub mod unitig; pub mod unitig;
+3 -6
View File
@@ -20,10 +20,8 @@ enum Commands {
Index(cmd::index::IndexArgs), Index(cmd::index::IndexArgs),
/// Merge multiple built indexes into one /// Merge multiple built indexes into one
Merge(cmd::merge::MergeArgs), Merge(cmd::merge::MergeArgs),
/// Filter and compact an existing index into a new single-layer index /// Apply row-level selection (σ) to an index: retain only k-mers matching the predicates
Rebuild(cmd::rebuild::RebuildArgs), Filter(cmd::filter::FilterArgs),
/// Alias for rebuild
Filter(cmd::rebuild::RebuildArgs),
/// Project and/or aggregate genome columns into a new or in-place index /// Project and/or aggregate genome columns into a new or in-place index
Select(cmd::select::SelectArgs), Select(cmd::select::SelectArgs),
/// Query an index with sequences and annotate matches /// Query an index with sequences and annotate matches
@@ -69,8 +67,7 @@ fn main() {
Commands::Index(args) => cmd::index::run(args), Commands::Index(args) => cmd::index::run(args),
Commands::Merge(args) => cmd::merge::run(args), Commands::Merge(args) => cmd::merge::run(args),
Commands::Dump(args) => cmd::dump::run(args), Commands::Dump(args) => cmd::dump::run(args),
Commands::Rebuild(args) => cmd::rebuild::run(args), Commands::Filter(args) => cmd::filter::run(args),
Commands::Filter(args) => cmd::rebuild::run(args),
Commands::Select(args) => cmd::select::run(args), Commands::Select(args) => cmd::select::run(args),
Commands::Query(args) => cmd::query::run(args), Commands::Query(args) => cmd::query::run(args),
Commands::Annotate(args) => cmd::annotate::run(args), Commands::Annotate(args) => cmd::annotate::run(args),