Kmer filtering and ingroup/outgroup predicates
The rebuild, dump, and unitig commands all share the same filtering
system. Filters can select k-mers based on per-genome quorum counts, optionally
restricted to ingroup and outgroup genome sets derived from genome
metadata.
rebuild additionally accepts --min-total-count / --max-total-count filters
that operate on the sum of counts across all genomes.
Predicate syntax
Each --ingroup and --outgroup flag takes a predicate of the form:
key OP value1|value2|…
| Operator | Meaning |
|---|---|
* or all |
wildcard — every genome matches unconditionally |
key=v1\|v2 |
exact match — genome's key equals v1 or v2 |
key!=v |
negation — genome's key equals none of the values |
key~path |
path ancestry — genome's key is path or a descendant |
key!~path |
not a descendant |
Multiple values separated by | are always OR-ed within the predicate.
Path matching (~ and !~)
Metadata values can represent hierarchical taxonomic paths such as
/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana.
- Absolute pattern (starts with
/): the value must start with the pattern at a segment boundary.taxon~/Betulaceae/Betulamatches/Betulaceae/Betula/nanaand/Betulaceae/Betulabut not/Betulaceae/Betuloides/…. - Bare segment (no leading
/): the value must contain the pattern as an exact path component anywhere.taxon~Betulamatches any path that hasBetulaas one of its segments.
Missing metadata key → NA
If a genome does not carry the queried metadata key, the predicate returns NA. NA propagates through the group evaluation logic (see below), and genomes that cannot be classified are ignored in all quorum counts.
Group semantics
Multiple predicates
| Flag | Combination rule |
|---|---|
--ingroup (repeated) |
AND — genome must satisfy all predicates |
--outgroup (repeated) |
OR — genome satisfies any predicate |
Three-value logic
Each predicate returns true, false, or NA (absent key).
- AND:
falseabsorbs everything;NApropagates unless alreadyfalse. - OR:
trueabsorbs everything;NApropagates unless alreadytrue.
Classification and priority
For each genome:
- Evaluate
AND(ingroup predicates)→in_result - Evaluate
OR(outgroup predicates)→out_result - If
in_result = true→ Ingroup (ingroup wins over outgroup) - Else if
out_result = true→ Outgroup - Otherwise → Uncategorized (ignored in all quorum counts)
Implicit groups
--ingroup |
--outgroup |
Effective behaviour |
|---|---|---|
| not set | not set | all genomes form the ingroup |
| set | not set | only ingroup quorum flags apply |
| not set | set | only outgroup quorum flags apply |
| set | set | both constraints apply simultaneously |
Quorum flags
| Flag | Applies to | Meaning |
|---|---|---|
--min-count N |
ingroup | k-mer present in at least N ingroup genomes |
--max-count N |
ingroup | k-mer present in at most N ingroup genomes |
--min-frac F |
ingroup | k-mer present in at least fraction F of ingroup genomes |
--max-frac F |
ingroup | k-mer present in at most fraction F of ingroup genomes |
--min-outgroup-count N |
outgroup | k-mer present in at least N outgroup genomes |
--max-outgroup-count N |
outgroup | k-mer present in at most N outgroup genomes |
--min-outgroup-frac F |
outgroup | k-mer present in at least fraction F of outgroup genomes |
--max-outgroup-frac F |
outgroup | k-mer present in at most fraction F of outgroup genomes |
--min-total-count N |
all genomes | sum of per-genome counts ≥ N (rebuild only) |
--max-total-count N |
all genomes | sum of per-genome counts ≤ N (rebuild only) |
--presence-threshold N |
all | per-genome count > N to be considered "present" (default 0) |
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0 (no upper bound). A filter with all defaults is a no-op.
Fractions are computed over the size of the classified group, not over total genome count. An empty group (no genome classified as ingroup/outgroup) never triggers a filter failure.
Examples
Keep k-mers specific to Betula nana — present in at least 2 B. nana genomes and absent from every other genome in the index:
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
Keep k-mers found in at least 2 Betula nana genomes and absent from all other Betula:
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
--max-outgroup-count 0
Use taxonomic paths — keep k-mers present in ≥ 50 % of the Betula clade and in fewer than 10 % of everything outside Betulaceae:
obikmer rebuild src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
--max-outgroup-frac 0.1
Multiple outgroup predicates (OR): exclude k-mers present in Alnus or Carpinus:
obikmer rebuild src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
--max-outgroup-count 0
The same flags work identically for dump and unitig. To dump only k-mers
specific to Betula nana:
obikmer dump myindex \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 1 \
--max-outgroup-count 0
To enumerate unitigs of the Betula-specific subgraph:
obikmer unitig myindex \
--ingroup "genus=Betula" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
Implementation
-
obikpartitionner::filter::GroupQuorumFilter— implementsKmerFilterusing pre-computed ingroup and outgroup index vectors. The heavy logic (predicate parsing, three-value evaluation, genome classification) happens once before any iteration; each k-mer row evaluation is a simple index lookup and counter. -
obikmer::cmd::predicate::FilterArgs— sharedclapargument group embedded via#[command(flatten)]inRebuildArgs,DumpArgs, andUnitigArgs.FilterArgs::build_filters()returns a ready-to-use filter list. -
obikpartitionner::KmerPartition::iter_partition_kmers— acceptsfilters: &[Box<dyn KmerFilter>]and applies them per-kmer before invoking the callback.rebuild,dump, andunitigall go through this single entry point.