docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
@@ -1,9 +1,12 @@
|
||||
# Rebuild filters and ingroup/outgroup predicates
|
||||
# Kmer filtering and ingroup/outgroup predicates
|
||||
|
||||
The `rebuild` command compacts an existing index into a new single-layer index,
|
||||
optionally keeping only k-mers that satisfy a set of filters.
|
||||
Filters can operate on raw quorum counts over all genomes, or on pre-defined
|
||||
**ingroup** and **outgroup** genome sets derived from genome metadata.
|
||||
The `rebuild`, `dump`, and `unitig` commands all share the same filtering
|
||||
system. Filters can select k-mers based on per-genome quorum counts, optionally
|
||||
restricted to **ingroup** and **outgroup** genome sets derived from genome
|
||||
metadata.
|
||||
|
||||
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
|
||||
that operate on the sum of counts across all genomes.
|
||||
|
||||
## Predicate syntax
|
||||
|
||||
@@ -73,8 +76,8 @@ For each genome:
|
||||
| `--ingroup` | `--outgroup` | Effective behaviour |
|
||||
|-------------|--------------|---------------------|
|
||||
| not set | not set | all genomes form the ingroup |
|
||||
| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
|
||||
| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
|
||||
| set | not set | only ingroup quorum flags apply |
|
||||
| not set | set | only outgroup quorum flags apply |
|
||||
| set | set | both constraints apply simultaneously |
|
||||
|
||||
## Quorum flags
|
||||
@@ -89,10 +92,13 @@ For each genome:
|
||||
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
|
||||
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
|
||||
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
|
||||
(no upper bound). A filter with all defaults is a no-op.
|
||||
|
||||
Fractions are computed over the size of the classified group, not over total
|
||||
genome count. An empty group (no genome classified as ingroup/outgroup) never
|
||||
triggers a filter failure.
|
||||
@@ -107,17 +113,18 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
|
||||
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
|
||||
other *Betula*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "genus=Betula" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
|
||||
@@ -128,7 +135,7 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "taxon~/Betulaceae/Betula" \
|
||||
--outgroup "taxon!~/Betulaceae" \
|
||||
--min-frac 0.5 \
|
||||
--max-frac 0.1
|
||||
--max-outgroup-frac 0.1
|
||||
```
|
||||
|
||||
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
|
||||
@@ -138,7 +145,28 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "genus=Alnus" \
|
||||
--outgroup "genus=Carpinus" \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
The same flags work identically for `dump` and `unitig`. To dump only k-mers
|
||||
specific to *Betula nana*:
|
||||
|
||||
```sh
|
||||
obikmer dump myindex \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 1 \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
To enumerate unitigs of the *Betula*-specific subgraph:
|
||||
|
||||
```sh
|
||||
obikmer unitig myindex \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
## Implementation
|
||||
@@ -146,9 +174,15 @@ obikmer rebuild src --output dst \
|
||||
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
|
||||
using pre-computed ingroup and outgroup index vectors. The heavy logic
|
||||
(predicate parsing, three-value evaluation, genome classification) happens
|
||||
once before the rebuild loop; each k-mer row evaluation is a simple index
|
||||
once before any iteration; each k-mer row evaluation is a simple index
|
||||
lookup and counter.
|
||||
|
||||
- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
|
||||
(`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
|
||||
which returns a ready-to-use `GroupQuorumFilter`.
|
||||
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
|
||||
embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
|
||||
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
|
||||
list.
|
||||
|
||||
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
|
||||
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
|
||||
the callback. `rebuild`, `dump`, and `unitig` all go through this single
|
||||
entry point.
|
||||
|
||||
Reference in New Issue
Block a user