refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
# Kmer filtering and ingroup/outgroup predicates
|
||||
|
||||
The `rebuild`, `dump`, and `unitig` commands share the same filtering system,
|
||||
The `filter`, `dump`, and `unitig` commands share the same filtering system,
|
||||
implemented as a shared `FilterArgs` clap argument group embedded in each command
|
||||
via `#[command(flatten)]`. Filters select k-mers based on per-genome quorum
|
||||
counts, optionally restricted to **ingroup** and **outgroup** genome sets derived
|
||||
from genome metadata. All rules described here apply identically to all three commands.
|
||||
|
||||
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
|
||||
`filter` additionally accepts `--min-total-count` / `--max-total-count` filters
|
||||
that operate on the sum of counts across all genomes.
|
||||
|
||||
## Predicate syntax
|
||||
@@ -93,8 +93,8 @@ For each genome:
|
||||
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
|
||||
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
|
||||
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`filter` only) |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`filter` only) |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
**Conditional defaults** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
|
||||
@@ -169,7 +169,7 @@ Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genome
|
||||
and absent from every other genome in the index:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
obikmer filter src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
@@ -180,7 +180,7 @@ Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
|
||||
other *Betula*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
obikmer filter src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "genus=Betula" \
|
||||
--min-count 2 \
|
||||
@@ -191,7 +191,7 @@ Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
|
||||
and in fewer than 10 % of everything outside *Betulaceae*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
obikmer filter src --output dst \
|
||||
--ingroup "taxon~/Betulaceae/Betula" \
|
||||
--outgroup "taxon!~/Betulaceae" \
|
||||
--min-frac 0.5 \
|
||||
@@ -201,7 +201,7 @@ obikmer rebuild src --output dst \
|
||||
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
obikmer filter src --output dst \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "genus=Alnus" \
|
||||
--outgroup "genus=Carpinus" \
|
||||
@@ -244,7 +244,7 @@ obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1
|
||||
### `distance --presence-threshold N`
|
||||
|
||||
When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1).
|
||||
This option is independent of the `--presence-threshold` used in `rebuild`/`query` filtering.
|
||||
This option is independent of the `--presence-threshold` used in filtering.
|
||||
|
||||
```sh
|
||||
# Jaccard treating kmers with count ≥ 2 as present
|
||||
@@ -262,11 +262,11 @@ This parameter has no effect on presence/absence indexes (where values are alrea
|
||||
lookup and counter.
|
||||
|
||||
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
|
||||
embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
|
||||
embedded via `#[command(flatten)]` in `FilterArgs`, `DumpArgs`, and
|
||||
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
|
||||
list.
|
||||
|
||||
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
|
||||
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
|
||||
the callback. `rebuild`, `dump`, and `unitig` all go through this single
|
||||
the callback. `filter`, `dump`, and `unitig` all go through this single
|
||||
entry point.
|
||||
|
||||
@@ -56,8 +56,8 @@ first, then renames atomically, so an interrupted run leaves the index intact.
|
||||
--group <name>:<pred>
|
||||
```
|
||||
|
||||
Defines a named group of genomes using the same predicate syntax as
|
||||
`filter`/`rebuild`. Repeatable; a genome can belong to several groups.
|
||||
Defines a named group of genomes using the same predicate syntax as `filter`.
|
||||
Repeatable; a genome can belong to several groups.
|
||||
|
||||
```sh
|
||||
--group "pub:species=Betula_pubescens"
|
||||
|
||||
+1
-1
@@ -9,7 +9,7 @@
|
||||
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
|
||||
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
|
||||
| `merge` | Merge multiple built indexes into one |
|
||||
| `rebuild` / `filter` | Filter and compact an existing index into a new single-layer index; supports the shared [kmer filtering](implementation/filtering.md) system (`filter` is an alias for `rebuild`) |
|
||||
| `filter` | Apply row-level selection (σ) to an index: retain only k-mers matching the ingroup/outgroup predicates. Output is a new single-layer index — compaction is a consequence, not the goal. Supports the shared [kmer filtering](implementation/filtering.md) system |
|
||||
| `query` | Query an index with sequences and annotate matches |
|
||||
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared [kmer filtering](implementation/filtering.md) system; `--head N` limits output to the first N k-mers |
|
||||
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
||||
|
||||
Reference in New Issue
Block a user