docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
Eric Coissac
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
+52 -18
View File
@@ -1,9 +1,12 @@
# Rebuild filters and ingroup/outgroup predicates
# Kmer filtering and ingroup/outgroup predicates
The `rebuild` command compacts an existing index into a new single-layer index,
optionally keeping only k-mers that satisfy a set of filters.
Filters can operate on raw quorum counts over all genomes, or on pre-defined
**ingroup** and **outgroup** genome sets derived from genome metadata.
The `rebuild`, `dump`, and `unitig` commands all share the same filtering
system. Filters can select k-mers based on per-genome quorum counts, optionally
restricted to **ingroup** and **outgroup** genome sets derived from genome
metadata.
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
that operate on the sum of counts across all genomes.
## Predicate syntax
@@ -73,8 +76,8 @@ For each genome:
| `--ingroup` | `--outgroup` | Effective behaviour |
|-------------|--------------|---------------------|
| not set | not set | all genomes form the ingroup |
| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
| set | not set | only ingroup quorum flags apply |
| not set | set | only outgroup quorum flags apply |
| set | set | both constraints apply simultaneously |
## Quorum flags
@@ -89,10 +92,13 @@ For each genome:
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
(no upper bound). A filter with all defaults is a no-op.
Fractions are computed over the size of the classified group, not over total
genome count. An empty group (no genome classified as ingroup/outgroup) never
triggers a filter failure.
@@ -107,17 +113,18 @@ obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 2 \
--max-count 0
--max-outgroup-count 0
```
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
other *Betula*:
```sh
obikmer rebuild src --output dst \
--ingroup "species=Betula_nana" \
--outgroup "genus=Betula" \
--min-count 2 \
--max-count 0
--max-outgroup-count 0
```
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
@@ -128,7 +135,7 @@ obikmer rebuild src --output dst \
--ingroup "taxon~/Betulaceae/Betula" \
--outgroup "taxon!~/Betulaceae" \
--min-frac 0.5 \
--max-frac 0.1
--max-outgroup-frac 0.1
```
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
@@ -138,7 +145,28 @@ obikmer rebuild src --output dst \
--ingroup "genus=Betula" \
--outgroup "genus=Alnus" \
--outgroup "genus=Carpinus" \
--max-count 0
--max-outgroup-count 0
```
The same flags work identically for `dump` and `unitig`. To dump only k-mers
specific to *Betula nana*:
```sh
obikmer dump myindex \
--ingroup "species=Betula_nana" \
--outgroup "*" \
--min-count 1 \
--max-outgroup-count 0
```
To enumerate unitigs of the *Betula*-specific subgraph:
```sh
obikmer unitig myindex \
--ingroup "genus=Betula" \
--outgroup "*" \
--min-count 2 \
--max-outgroup-count 0
```
## Implementation
@@ -146,9 +174,15 @@ obikmer rebuild src --output dst \
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
using pre-computed ingroup and outgroup index vectors. The heavy logic
(predicate parsing, three-value evaluation, genome classification) happens
once before the rebuild loop; each k-mer row evaluation is a simple index
once before any iteration; each k-mer row evaluation is a simple index
lookup and counter.
- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
(`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
which returns a ready-to-use `GroupQuorumFilter`.
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
list.
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
the callback. `rebuild`, `dump`, and `unitig` all go through this single
entry point.