2026-06-04 21:27:01 +02:00
# Kmer filtering and ingroup/outgroup predicates
2026-06-04 20:26:53 +02:00
2026-06-09 09:57:38 +02:00
The `rebuild` , `dump` , and `unitig` commands share the same filtering system,
implemented as a shared `FilterArgs` clap argument group embedded in each command
via `#[command(flatten)]` . Filters select k-mers based on per-genome quorum
counts, optionally restricted to **ingroup ** and **outgroup ** genome sets derived
from genome metadata. All rules described here apply identically to all three commands.
2026-06-04 21:27:01 +02:00
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
that operate on the sum of counts across all genomes.
2026-06-04 20:26:53 +02:00
## Predicate syntax
Each `--ingroup` and `--outgroup` flag takes a predicate of the form:
```
key OP value1|value2|…
```
| Operator | Meaning |
|----------|---------|
| `*` or `all` | wildcard — every genome matches unconditionally |
| `key=v1\|v2` | exact match — genome's `key` equals `v1` or `v2` |
| `key!=v` | negation — genome's `key` equals none of the values |
| `key~path` | path ancestry — genome's `key` is `path` or a descendant |
| `key!~path` | not a descendant |
Multiple values separated by `|` are always OR-ed within the predicate.
### Path matching (`~` and `!~`)
Metadata values can represent hierarchical taxonomic paths such as
`/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana` .
- **Absolute pattern** (starts with `/` ): the value must start with the pattern
at a segment boundary.
`taxon~/Betulaceae/Betula` matches `/Betulaceae/Betula/nana` and
`/Betulaceae/Betula` but not `/Betulaceae/Betuloides/…` .
- **Bare segment** (no leading `/` ): the value must contain the pattern as an
exact path component anywhere.
`taxon~Betula` matches any path that has `Betula` as one of its segments.
### Missing metadata key → NA
If a genome does not carry the queried metadata key, the predicate returns **NA ** .
NA propagates through the group evaluation logic (see below), and genomes that
cannot be classified are **ignored ** in all quorum counts.
## Group semantics
### Multiple predicates
| Flag | Combination rule |
|------|-----------------|
| `--ingroup` (repeated) | **AND ** — genome must satisfy all predicates |
| `--outgroup` (repeated) | **OR ** — genome satisfies any predicate |
### Three-value logic
Each predicate returns `true` , `false` , or `NA` (absent key).
- AND: `false` absorbs everything; `NA` propagates unless already `false` .
- OR: `true` absorbs everything; `NA` propagates unless already `true` .
### Classification and priority
For each genome:
1. Evaluate `AND(ingroup predicates)` → `in_result`
2. Evaluate `OR(outgroup predicates)` → `out_result`
3. If `in_result = true` → **Ingroup ** (ingroup wins over outgroup)
4. Else if `out_result = true` → **Outgroup **
5. Otherwise → **Uncategorized ** (ignored in all quorum counts)
### Implicit groups
| `--ingroup` | `--outgroup` | Effective behaviour |
|-------------|--------------|---------------------|
| not set | not set | all genomes form the ingroup |
2026-06-04 21:27:01 +02:00
| set | not set | only ingroup quorum flags apply |
| not set | set | only outgroup quorum flags apply |
2026-06-04 20:26:53 +02:00
| set | set | both constraints apply simultaneously |
## Quorum flags
| Flag | Applies to | Meaning |
|------|-----------|---------|
| `--min-count N` | ingroup | k-mer present in at least N ingroup genomes |
| `--max-count N` | ingroup | k-mer present in at most N ingroup genomes |
| `--min-frac F` | ingroup | k-mer present in at least fraction F of ingroup genomes |
| `--max-frac F` | ingroup | k-mer present in at most fraction F of ingroup genomes |
| `--min-outgroup-count N` | outgroup | k-mer present in at least N outgroup genomes |
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
2026-06-04 21:27:01 +02:00
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
2026-06-04 20:26:53 +02:00
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
2026-06-09 09:57:38 +02:00
**Conditional defaults ** — the defaults for `--min-frac` and `--max-outgroup-count` depend on two conditions:
whether the corresponding group was declared, **and ** whether any quorum flag for that group was explicitly set.
2026-06-09 09:47:44 +02:00
2026-06-09 09:57:38 +02:00
> **Rule**: declaring a group activates the smart default **only if no quorum flag for that group is explicitly set**.
> As soon as any quorum flag for a group is present on the command line, all defaults for that group revert to no-op values.
2026-06-09 09:47:44 +02:00
2026-06-09 09:57:38 +02:00
| `--ingroup` | Any ingroup quorum flag? | `--min-frac` default |
|-------------|--------------------------|----------------------|
| not set | — | 0.0 (no-op) |
| set | no | **1.0 ** — all ingroup genomes must carry the k-mer |
| set | yes | 0.0 — user controls quorum explicitly |
| `--outgroup` | Any outgroup quorum flag? | `--max-outgroup-count` default |
|--------------|---------------------------|-------------------------------|
| not set | — | outgroup size (no-op) |
| set | no | **0 ** — no outgroup genome may carry the k-mer |
| set | yes | outgroup size — user controls quorum explicitly |
"Any ingroup quorum flag" means any of: `--min-count` , `--max-count` , `--min-frac` , `--max-frac` .
"Any outgroup quorum flag" means any of: `--min-outgroup-count` , `--max-outgroup-count` , `--min-outgroup-frac` , `--max-outgroup-frac` .
**Why this rule? ** Setting any quorum flag signals explicit intent — the defaults are there to help when the user omits quorum entirely, not to interfere with deliberate constraints. Mixing implicit and explicit quorum on the same group would risk silent incoherence (e.g. `--max-count 0` with an implicit `--min-frac 1.0` ).
All other bounds default to 0 / group size / 0.0 / 1.0 regardless of whether groups are declared.
### Validation
After resolving defaults, the following are checked and cause an immediate error:
| Condition | Error |
|-----------|-------|
| `--min-count > --max-count` | incoherent bounds |
| `--min-frac > --max-frac` | incoherent bounds |
| `--min-outgroup-count > --max-outgroup-count` | incoherent bounds |
| `--min-outgroup-frac > --max-outgroup-frac` | incoherent bounds |
| any fraction outside `[0.0, 1.0]` | invalid value |
The check applies to the **effective ** values (after defaults are resolved), so an explicit `--max-frac 0.5` with an implicit `--min-frac 1.0` would have been caught — but the rule above prevents that situation from arising in the first place.
2026-06-04 21:27:01 +02:00
2026-06-04 20:26:53 +02:00
Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure.
2026-06-09 10:24:25 +02:00
### Conservative rounding of fraction thresholds
When a fraction threshold `F` is applied to a group of size `N` , the effective
integer threshold is determined by the direction of the bound:
| Bound | Effective count | Rounding | Rationale |
|-------|----------------|----------|-----------|
| `--min-frac F` | k-mer in ≥ ⌈F·N⌉ genomes | **ceil ** | stricter — a kmer present in exactly ⌊F·N⌋ genomes does not meet the fraction |
| `--max-frac F` | k-mer in ≤ ⌊F·N⌋ genomes | **floor ** | stricter — a kmer present in ⌈F·N⌉ genomes already exceeds the fraction |
The same rule applies symmetrically to `--min-outgroup-frac` (ceil) and
`--max-outgroup-frac` (floor). The outgroup direction is not inverted: the
conservative rounding depends only on whether the bound is a minimum or a
maximum, not on which group it applies to.
**Example ** — `--min-frac 0.5` with an ingroup of 3 genomes:
`⌈0.5 × 3⌉ = ⌈1.5⌉ = 2` → at least 2 of 3 ingroup genomes must carry the k-mer.
**Implementation note ** — the filter evaluates `n / denom < min_frac` directly
(integer `n` , float comparison) rather than pre-computing `⌈F·N⌉` . This is
mathematically equivalent for integer counts: `n / N < F` ↔ `n < F·N` ↔
`n ≤ ⌈F·N⌉ − 1` ↔ `n < ⌈F·N⌉` . No explicit rounding is needed.
2026-06-04 20:26:53 +02:00
## Examples
Keep k-mers specific to * Betula nana * — present in at least 2 * B. nana * genomes
and absent from every other genome in the index:
``` sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
2026-06-04 21:27:01 +02:00
--max-outgroup-count 0
2026-06-04 20:26:53 +02:00
```
2026-06-04 21:27:01 +02:00
Keep k-mers found in at least 2 * Betula nana * genomes and absent from all
other * Betula * :
2026-06-04 20:26:53 +02:00
``` sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
2026-06-04 21:27:01 +02:00
--max-outgroup-count 0
2026-06-04 20:26:53 +02:00
```
Use taxonomic paths — keep k-mers present in ≥ 50 % of the * Betula * clade
and in fewer than 10 % of everything outside * Betulaceae * :
``` sh
obikmer rebuild src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
2026-06-04 21:27:01 +02:00
--max-outgroup-frac 0.1
2026-06-04 20:26:53 +02:00
```
Multiple outgroup predicates (OR): exclude k-mers present in * Alnus * or * Carpinus * :
``` sh
obikmer rebuild src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
2026-06-04 21:27:01 +02:00
--max-outgroup-count 0
```
2026-06-09 09:57:38 +02:00
To dump only k-mers specific to * Betula nana * :
2026-06-04 21:27:01 +02:00
``` sh
obikmer dump myindex \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 1 \
--max-outgroup-count 0
```
To enumerate unitigs of the * Betula * -specific subgraph:
``` sh
obikmer unitig myindex \
--ingroup "genus=Betula" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
2026-06-04 20:26:53 +02:00
```
2026-06-09 09:47:44 +02:00
## Command-specific options
### `dump --head N`
Stops output after the first N k-mers that pass all active filters.
Iteration terminates immediately — subsequent partitions and layers are not scanned.
Useful for quick inspection of large indexes without loading the entire dataset.
``` sh
obikmer dump myindex --head 100
obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1
```
### `distance --presence-threshold N`
When computing Jaccard distance on a **count index ** , a k-mer is considered present in a genome if its count is ≥ N (default 1).
This option is independent of the `--presence-threshold` used in `rebuild` /`query` filtering.
``` sh
# Jaccard treating kmers with count ≥ 2 as present
obikmer distance myindex --metric jaccard --presence-threshold 2
```
This parameter has no effect on presence/absence indexes (where values are already 0/1) or on metrics other than Jaccard.
2026-06-04 20:26:53 +02:00
## Implementation
- **`obikpartitionner::filter::GroupQuorumFilter` ** — implements `KmerFilter`
using pre-computed ingroup and outgroup index vectors. The heavy logic
(predicate parsing, three-value evaluation, genome classification) happens
2026-06-04 21:27:01 +02:00
once before any iteration; each k-mer row evaluation is a simple index
2026-06-04 20:26:53 +02:00
lookup and counter.
2026-06-04 21:27:01 +02:00
- **`obikmer::cmd::predicate::FilterArgs` ** — shared `clap` argument group
embedded via `#[command(flatten)]` in `RebuildArgs` , `DumpArgs` , and
`UnitigArgs` . `FilterArgs::build_filters()` returns a ready-to-use filter
list.
- **`obikpartitionner::KmerPartition::iter_partition_kmers` ** — accepts
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
the callback. `rebuild` , `dump` , and `unitig` all go through this single
entry point.