feat: add metadata-driven k-mer filtering for rebuild command
Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
This commit is contained in:
@@ -0,0 +1,154 @@
|
||||
# Rebuild filters and ingroup/outgroup predicates
|
||||
|
||||
The `rebuild` command compacts an existing index into a new single-layer index,
|
||||
optionally keeping only k-mers that satisfy a set of filters.
|
||||
Filters can operate on raw quorum counts over all genomes, or on pre-defined
|
||||
**ingroup** and **outgroup** genome sets derived from genome metadata.
|
||||
|
||||
## Predicate syntax
|
||||
|
||||
Each `--ingroup` and `--outgroup` flag takes a predicate of the form:
|
||||
|
||||
```
|
||||
key OP value1|value2|…
|
||||
```
|
||||
|
||||
| Operator | Meaning |
|
||||
|----------|---------|
|
||||
| `*` or `all` | wildcard — every genome matches unconditionally |
|
||||
| `key=v1\|v2` | exact match — genome's `key` equals `v1` or `v2` |
|
||||
| `key!=v` | negation — genome's `key` equals none of the values |
|
||||
| `key~path` | path ancestry — genome's `key` is `path` or a descendant |
|
||||
| `key!~path` | not a descendant |
|
||||
|
||||
Multiple values separated by `|` are always OR-ed within the predicate.
|
||||
|
||||
### Path matching (`~` and `!~`)
|
||||
|
||||
Metadata values can represent hierarchical taxonomic paths such as
|
||||
`/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana`.
|
||||
|
||||
- **Absolute pattern** (starts with `/`): the value must start with the pattern
|
||||
at a segment boundary.
|
||||
`taxon~/Betulaceae/Betula` matches `/Betulaceae/Betula/nana` and
|
||||
`/Betulaceae/Betula` but not `/Betulaceae/Betuloides/…`.
|
||||
- **Bare segment** (no leading `/`): the value must contain the pattern as an
|
||||
exact path component anywhere.
|
||||
`taxon~Betula` matches any path that has `Betula` as one of its segments.
|
||||
|
||||
### Missing metadata key → NA
|
||||
|
||||
If a genome does not carry the queried metadata key, the predicate returns **NA**.
|
||||
NA propagates through the group evaluation logic (see below), and genomes that
|
||||
cannot be classified are **ignored** in all quorum counts.
|
||||
|
||||
## Group semantics
|
||||
|
||||
### Multiple predicates
|
||||
|
||||
| Flag | Combination rule |
|
||||
|------|-----------------|
|
||||
| `--ingroup` (repeated) | **AND** — genome must satisfy all predicates |
|
||||
| `--outgroup` (repeated) | **OR** — genome satisfies any predicate |
|
||||
|
||||
### Three-value logic
|
||||
|
||||
Each predicate returns `true`, `false`, or `NA` (absent key).
|
||||
|
||||
- AND: `false` absorbs everything; `NA` propagates unless already `false`.
|
||||
- OR: `true` absorbs everything; `NA` propagates unless already `true`.
|
||||
|
||||
### Classification and priority
|
||||
|
||||
For each genome:
|
||||
|
||||
1. Evaluate `AND(ingroup predicates)` → `in_result`
|
||||
2. Evaluate `OR(outgroup predicates)` → `out_result`
|
||||
3. If `in_result = true` → **Ingroup** (ingroup wins over outgroup)
|
||||
4. Else if `out_result = true` → **Outgroup**
|
||||
5. Otherwise → **Uncategorized** (ignored in all quorum counts)
|
||||
|
||||
### Implicit groups
|
||||
|
||||
| `--ingroup` | `--outgroup` | Effective behaviour |
|
||||
|-------------|--------------|---------------------|
|
||||
| not set | not set | all genomes form the ingroup |
|
||||
| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
|
||||
| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
|
||||
| set | set | both constraints apply simultaneously |
|
||||
|
||||
## Quorum flags
|
||||
|
||||
| Flag | Applies to | Meaning |
|
||||
|------|-----------|---------|
|
||||
| `--min-count N` | ingroup | k-mer present in at least N ingroup genomes |
|
||||
| `--max-count N` | ingroup | k-mer present in at most N ingroup genomes |
|
||||
| `--min-frac F` | ingroup | k-mer present in at least fraction F of ingroup genomes |
|
||||
| `--max-frac F` | ingroup | k-mer present in at most fraction F of ingroup genomes |
|
||||
| `--min-outgroup-count N` | outgroup | k-mer present in at least N outgroup genomes |
|
||||
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
|
||||
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
|
||||
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
Fractions are computed over the size of the classified group, not over total
|
||||
genome count. An empty group (no genome classified as ingroup/outgroup) never
|
||||
triggers a filter failure.
|
||||
|
||||
## Examples
|
||||
|
||||
Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes
|
||||
and absent from every other genome in the index:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
```
|
||||
|
||||
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "genus=Betula" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
```
|
||||
|
||||
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
|
||||
and in fewer than 10 % of everything outside *Betulaceae*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "taxon~/Betulaceae/Betula" \
|
||||
--outgroup "taxon!~/Betulaceae" \
|
||||
--min-frac 0.5 \
|
||||
--max-frac 0.1
|
||||
```
|
||||
|
||||
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "genus=Alnus" \
|
||||
--outgroup "genus=Carpinus" \
|
||||
--max-count 0
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
|
||||
using pre-computed ingroup and outgroup index vectors. The heavy logic
|
||||
(predicate parsing, three-value evaluation, genome classification) happens
|
||||
once before the rebuild loop; each k-mer row evaluation is a simple index
|
||||
lookup and counter.
|
||||
|
||||
- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
|
||||
(`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
|
||||
which returns a ready-to-use `GroupQuorumFilter`.
|
||||
Reference in New Issue
Block a user