feat: add metadata-driven k-mer filtering for rebuild command

Introduces a metadata-driven filtering system for the rebuild command, classifying genomes into ingroup and outgroup categories using exact, inequality, and hierarchical path predicates. Implements a GroupQuorumFilter to enforce configurable presence thresholds and fraction constraints per group. Refactors the command to replace global quorum filters with this unified approach, converts the presence flag to a threshold parameter, and adds corresponding documentation and MkDocs navigation.
This commit is contained in:
Eric Coissac
2026-06-04 20:26:53 +02:00
parent edc18b4908
commit 476c7a6394
7 changed files with 470 additions and 33 deletions
+154
View File
@@ -0,0 +1,154 @@
# Rebuild filters and ingroup/outgroup predicates
The `rebuild` command compacts an existing index into a new single-layer index,
optionally keeping only k-mers that satisfy a set of filters.
Filters can operate on raw quorum counts over all genomes, or on pre-defined
**ingroup** and **outgroup** genome sets derived from genome metadata.
## Predicate syntax
Each `--ingroup` and `--outgroup` flag takes a predicate of the form:
```
key OP value1|value2|…
```
| Operator | Meaning |
|----------|---------|
| `*` or `all` | wildcard — every genome matches unconditionally |
| `key=v1\|v2` | exact match — genome's `key` equals `v1` or `v2` |
| `key!=v` | negation — genome's `key` equals none of the values |
| `key~path` | path ancestry — genome's `key` is `path` or a descendant |
| `key!~path` | not a descendant |
Multiple values separated by `|` are always OR-ed within the predicate.
### Path matching (`~` and `!~`)
Metadata values can represent hierarchical taxonomic paths such as
`/Eukaryota/Viridiplantae/Streptophyta/Betulaceae/Betula/nana`.
- **Absolute pattern** (starts with `/`): the value must start with the pattern
at a segment boundary.
`taxon~/Betulaceae/Betula` matches `/Betulaceae/Betula/nana` and
`/Betulaceae/Betula` but not `/Betulaceae/Betuloides/…`.
- **Bare segment** (no leading `/`): the value must contain the pattern as an
exact path component anywhere.
`taxon~Betula` matches any path that has `Betula` as one of its segments.
### Missing metadata key → NA
If a genome does not carry the queried metadata key, the predicate returns **NA**.
NA propagates through the group evaluation logic (see below), and genomes that
cannot be classified are **ignored** in all quorum counts.
## Group semantics
### Multiple predicates
| Flag | Combination rule |
|------|-----------------|
| `--ingroup` (repeated) | **AND** — genome must satisfy all predicates |
| `--outgroup` (repeated) | **OR** — genome satisfies any predicate |
### Three-value logic
Each predicate returns `true`, `false`, or `NA` (absent key).
- AND: `false` absorbs everything; `NA` propagates unless already `false`.
- OR: `true` absorbs everything; `NA` propagates unless already `true`.
### Classification and priority
For each genome:
1. Evaluate `AND(ingroup predicates)``in_result`
2. Evaluate `OR(outgroup predicates)``out_result`
3. If `in_result = true`**Ingroup** (ingroup wins over outgroup)
4. Else if `out_result = true`**Outgroup**
5. Otherwise → **Uncategorized** (ignored in all quorum counts)
### Implicit groups
| `--ingroup` | `--outgroup` | Effective behaviour |
|-------------|--------------|---------------------|
| not set | not set | all genomes form the ingroup |
| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
| set | set | both constraints apply simultaneously |
## Quorum flags
| Flag | Applies to | Meaning |
|------|-----------|---------|
| `--min-count N` | ingroup | k-mer present in at least N ingroup genomes |
| `--max-count N` | ingroup | k-mer present in at most N ingroup genomes |
| `--min-frac F` | ingroup | k-mer present in at least fraction F of ingroup genomes |
| `--max-frac F` | ingroup | k-mer present in at most fraction F of ingroup genomes |
| `--min-outgroup-count N` | outgroup | k-mer present in at least N outgroup genomes |
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure.
## Examples
Keep k-mers specific to *Betula nana* — present in at least 2 *B. nana* genomes
and absent from every other genome in the index:
```sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
--max-count 0
```
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
```sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
--max-count 0
```
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
and in fewer than 10 % of everything outside *Betulaceae*:
```sh
obikmer rebuild src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
--max-frac 0.1
```
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
```sh
obikmer rebuild src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
--max-count 0
```
## Implementation
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
using pre-computed ingroup and outgroup index vectors. The heavy logic
(predicate parsing, three-value evaluation, genome classification) happens
once before the rebuild loop; each k-mer row evaluation is a simple index
lookup and counter.
- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
(`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
which returns a ready-to-use `GroupQuorumFilter`.