Files
Eric Coissac 970460be42 refactor: rename rebuild subcommand to filter
Rename the `rebuild` CLI subcommand to `filter` to better reflect its primary purpose of row-level selection and k-mer filtering. Update all associated CLI arguments, logging, error messages, and module registrations accordingly. Introduce a dedicated `Rebuild` subcommand for index compaction, fully decoupling it from the filtering logic. Also refine related documentation to align with the new naming and semantics.
2026-06-09 15:26:37 +02:00

11 KiB
Raw Permalink Blame History

Kmer filtering and ingroup/outgroup predicates

The filter, dump, and unitig commands share the same filtering system, implemented as a shared FilterArgs clap argument group embedded in each command via #[command(flatten)]. Filters select k-mers based on per-genome quorum counts, optionally restricted to ingroup and outgroup genome sets derived from genome metadata. All rules described here apply identically to all three commands.

filter additionally accepts --min-total-count / --max-total-count filters that operate on the sum of counts across all genomes.

Predicate syntax

Each --ingroup and --outgroup flag takes a predicate of the form:

key OP value1|value2|…
Operator Meaning
* or all wildcard — every genome matches unconditionally
key=v1|v2 exact match — genome's key equals v1 or v2
key!=v negation — genome's key equals none of the values
key~path path ancestry — genome's key is path or a descendant
key!~path not a descendant

Multiple values separated by | are always OR-ed within the predicate.

Path matching (~ and !~)

Metadata values can represent hierarchical taxonomic paths such as /Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana.

  • Absolute pattern (starts with /): the value must start with the pattern at a segment boundary. taxon~/Betulaceae/Betula matches /Betulaceae/Betula/nana and /Betulaceae/Betula but not /Betulaceae/Betuloides/….
  • Bare segment (no leading /): the value must contain the pattern as an exact path component anywhere. taxon~Betula matches any path that has Betula as one of its segments.

Missing metadata key → NA

If a genome does not carry the queried metadata key, the predicate returns NA. NA propagates through the group evaluation logic (see below), and genomes that cannot be classified are ignored in all quorum counts.

Group semantics

Multiple predicates

Flag Combination rule
--ingroup (repeated) AND — genome must satisfy all predicates
--outgroup (repeated) OR — genome satisfies any predicate

Three-value logic

Each predicate returns true, false, or NA (absent key).

  • AND: false absorbs everything; NA propagates unless already false.
  • OR: true absorbs everything; NA propagates unless already true.

Classification and priority

For each genome:

  1. Evaluate AND(ingroup predicates)in_result
  2. Evaluate OR(outgroup predicates)out_result
  3. If in_result = trueIngroup (ingroup wins over outgroup)
  4. Else if out_result = trueOutgroup
  5. Otherwise → Uncategorized (ignored in all quorum counts)

Implicit groups

--ingroup --outgroup Effective behaviour
not set not set all genomes form the ingroup
set not set only ingroup quorum flags apply
not set set only outgroup quorum flags apply
set set both constraints apply simultaneously

Quorum flags

Flag Applies to Meaning
--min-count N ingroup k-mer present in at least N ingroup genomes
--max-count N ingroup k-mer present in at most N ingroup genomes
--min-frac F ingroup k-mer present in at least fraction F of ingroup genomes
--max-frac F ingroup k-mer present in at most fraction F of ingroup genomes
--min-outgroup-count N outgroup k-mer present in at least N outgroup genomes
--max-outgroup-count N outgroup k-mer present in at most N outgroup genomes
--min-outgroup-frac F outgroup k-mer present in at least fraction F of outgroup genomes
--max-outgroup-frac F outgroup k-mer present in at most fraction F of outgroup genomes
--min-total-count N all genomes sum of per-genome counts ≥ N (filter only)
--max-total-count N all genomes sum of per-genome counts ≤ N (filter only)
--presence-threshold N all per-genome count > N to be considered "present" (default 0)

Conditional defaults — the defaults for --min-frac and --max-outgroup-count depend on two conditions: whether the corresponding group was declared, and whether any quorum flag for that group was explicitly set.

Rule: declaring a group activates the smart default only if no quorum flag for that group is explicitly set. As soon as any quorum flag for a group is present on the command line, all defaults for that group revert to no-op values.

--ingroup Any ingroup quorum flag? --min-frac default
not set 0.0 (no-op)
set no 1.0 — all ingroup genomes must carry the k-mer
set yes 0.0 — user controls quorum explicitly
--outgroup Any outgroup quorum flag? --max-outgroup-count default
not set outgroup size (no-op)
set no 0 — no outgroup genome may carry the k-mer
set yes outgroup size — user controls quorum explicitly

"Any ingroup quorum flag" means any of: --min-count, --max-count, --min-frac, --max-frac.
"Any outgroup quorum flag" means any of: --min-outgroup-count, --max-outgroup-count, --min-outgroup-frac, --max-outgroup-frac.

Why this rule? Setting any quorum flag signals explicit intent — the defaults are there to help when the user omits quorum entirely, not to interfere with deliberate constraints. Mixing implicit and explicit quorum on the same group would risk silent incoherence (e.g. --max-count 0 with an implicit --min-frac 1.0).

All other bounds default to 0 / group size / 0.0 / 1.0 regardless of whether groups are declared.

Validation

After resolving defaults, the following are checked and cause an immediate error:

Condition Error
--min-count > --max-count incoherent bounds
--min-frac > --max-frac incoherent bounds
--min-outgroup-count > --max-outgroup-count incoherent bounds
--min-outgroup-frac > --max-outgroup-frac incoherent bounds
any fraction outside [0.0, 1.0] invalid value

The check applies to the effective values (after defaults are resolved), so an explicit --max-frac 0.5 with an implicit --min-frac 1.0 would have been caught — but the rule above prevents that situation from arising in the first place.

Fractions are computed over the size of the classified group, not over total genome count. An empty group (no genome classified as ingroup/outgroup) never triggers a filter failure.

Conservative rounding of fraction thresholds

When a fraction threshold F is applied to a group of size N, the effective integer threshold is determined by the direction of the bound:

Bound Effective count Rounding Rationale
--min-frac F k-mer in ≥ ⌈F·N⌉ genomes ceil stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction
--max-frac F k-mer in ≤ ⌊F·N⌋ genomes floor stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction

The same rule applies symmetrically to --min-outgroup-frac (ceil) and --max-outgroup-frac (floor). The outgroup direction is not inverted: the conservative rounding depends only on whether the bound is a minimum or a maximum, not on which group it applies to.

Example--min-frac 0.5 with an ingroup of 3 genomes: ⌈0.5 × 3⌉ = ⌈1.5⌉ = 2 → at least 2 of 3 ingroup genomes must carry the k-mer.

Implementation note — the filter evaluates n / denom < min_frac directly (integer n, float comparison) rather than pre-computing ⌈F·N⌉. This is mathematically equivalent for integer counts: n / N < Fn < F·Nn ≤ ⌈F·N⌉ 1n < ⌈F·N⌉. No explicit rounding is needed.

Examples

Keep k-mers specific to Betula nana — present in at least 2 B. nana genomes and absent from every other genome in the index:

obikmer filter src --output dst \
  --ingroup  "species=Betula_nana" \
  --outgroup "*" \
  --min-count 2 \
  --max-outgroup-count 0

Keep k-mers found in at least 2 Betula nana genomes and absent from all other Betula:

obikmer filter src --output dst \
  --ingroup  "species=Betula_nana" \
  --outgroup "genus=Betula" \
  --min-count 2 \
  --max-outgroup-count 0

Use taxonomic paths — keep k-mers present in ≥ 50 % of the Betula clade and in fewer than 10 % of everything outside Betulaceae:

obikmer filter src --output dst \
  --ingroup  "taxon~/Betulaceae/Betula" \
  --outgroup "taxon!~/Betulaceae" \
  --min-frac 0.5 \
  --max-outgroup-frac 0.1

Multiple outgroup predicates (OR): exclude k-mers present in Alnus or Carpinus:

obikmer filter src --output dst \
  --ingroup  "genus=Betula" \
  --outgroup "genus=Alnus" \
  --outgroup "genus=Carpinus" \
  --max-outgroup-count 0

To dump only k-mers specific to Betula nana:

obikmer dump myindex \
  --ingroup  "species=Betula_nana" \
  --outgroup "*" \
  --min-count 1 \
  --max-outgroup-count 0

To enumerate unitigs of the Betula-specific subgraph:

obikmer unitig myindex \
  --ingroup  "genus=Betula" \
  --outgroup "*" \
  --min-count 2 \
  --max-outgroup-count 0

Command-specific options

dump --head N

Stops output after the first N k-mers that pass all active filters. Iteration terminates immediately — subsequent partitions and layers are not scanned. Useful for quick inspection of large indexes without loading the entire dataset.

obikmer dump myindex --head 100
obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1

distance --presence-threshold N

When computing Jaccard distance on a count index, a k-mer is considered present in a genome if its count is ≥ N (default 1). This option is independent of the --presence-threshold used in filtering.

# Jaccard treating kmers with count ≥ 2 as present
obikmer distance myindex --metric jaccard --presence-threshold 2

This parameter has no effect on presence/absence indexes (where values are already 0/1) or on metrics other than Jaccard.

Implementation

  • obikpartitionner::filter::GroupQuorumFilter — implements KmerFilter using pre-computed ingroup and outgroup index vectors. The heavy logic (predicate parsing, three-value evaluation, genome classification) happens once before any iteration; each k-mer row evaluation is a simple index lookup and counter.

  • obikmer::cmd::predicate::FilterArgs — shared clap argument group embedded via #[command(flatten)] in FilterArgs, DumpArgs, and UnitigArgs. FilterArgs::build_filters() returns a ready-to-use filter list.

  • obikpartitionner::KmerPartition::iter_partition_kmers — accepts filters: &[Box<dyn KmerFilter>] and applies them per-kmer before invoking the callback. filter, dump, and unitig all go through this single entry point.