docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -1,9 +1,12 @@
-# Rebuild filters and ingroup/outgroup predicates
+# Kmer filtering and ingroup/outgroup predicates

-The `rebuild` command compacts an existing index into a new single-layer index,
-optionally keeping only k-mers that satisfy a set of filters.
-Filters can operate on raw quorum counts over all genomes, or on pre-defined
-**ingroup** and **outgroup** genome sets derived from genome metadata.
+The `rebuild`, `dump`, and `unitig` commands all share the same filtering
+system. Filters can select k-mers based on per-genome quorum counts, optionally
+restricted to **ingroup** and **outgroup** genome sets derived from genome
+metadata.
+
+`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
+that operate on the sum of counts across all genomes.

 ## Predicate syntax

@@ -73,8 +76,8 @@ For each genome:
 | `--ingroup` | `--outgroup` | Effective behaviour |
 |-------------|--------------|---------------------|
 | not set | not set | all genomes form the ingroup |
-| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
-| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
+| set | not set | only ingroup quorum flags apply |
+| not set | set | only outgroup quorum flags apply |
 | set | set | both constraints apply simultaneously |

 ## Quorum flags
@@ -89,10 +92,13 @@ For each genome:
 | `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
 | `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
 | `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
-| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
-| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
+| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
+| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
 | `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |

+Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
+(no upper bound). A filter with all defaults is a no-op.
+
 Fractions are computed over the size of the classified group, not over total
 genome count. An empty group (no genome classified as ingroup/outgroup) never
 triggers a filter failure.
@@ -107,17 +113,18 @@ obikmer rebuild src --output dst \
  --ingroup  "species=Betula_nana" \
  --outgroup "*" \
  --min-count 2 \
-  --max-count 0
+  --max-outgroup-count 0
 ```

-Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
+Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
+other *Betula*:

 ```sh
 obikmer rebuild src --output dst \
  --ingroup  "species=Betula_nana" \
  --outgroup "genus=Betula" \
  --min-count 2 \
-  --max-count 0
+  --max-outgroup-count 0
 ```

 Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
@@ -128,7 +135,7 @@ obikmer rebuild src --output dst \
  --ingroup  "taxon~/Betulaceae/Betula" \
  --outgroup "taxon!~/Betulaceae" \
  --min-frac 0.5 \
-  --max-frac 0.1
+  --max-outgroup-frac 0.1
 ```

 Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
@@ -138,7 +145,28 @@ obikmer rebuild src --output dst \
  --ingroup  "genus=Betula" \
  --outgroup "genus=Alnus" \
  --outgroup "genus=Carpinus" \
-  --max-count 0
+  --max-outgroup-count 0
+```
+
+The same flags work identically for `dump` and `unitig`. To dump only k-mers
+specific to *Betula nana*:
+
+```sh
+obikmer dump myindex \
+  --ingroup  "species=Betula_nana" \
+  --outgroup "*" \
+  --min-count 1 \
+  --max-outgroup-count 0
+```
+
+To enumerate unitigs of the *Betula*-specific subgraph:
+
+```sh
+obikmer unitig myindex \
+  --ingroup  "genus=Betula" \
+  --outgroup "*" \
+  --min-count 2 \
+  --max-outgroup-count 0
 ```

 ## Implementation
@@ -146,9 +174,15 @@ obikmer rebuild src --output dst \
 - **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
  using pre-computed ingroup and outgroup index vectors. The heavy logic
  (predicate parsing, three-value evaluation, genome classification) happens
-  once before the rebuild loop; each k-mer row evaluation is a simple index
+  once before any iteration; each k-mer row evaluation is a simple index
  lookup and counter.

- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
-  (`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
-  which returns a ready-to-use `GroupQuorumFilter`.
+- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
+  embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
+  `UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
+  list.
+
+- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
+  `filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
+  the callback. `rebuild`, `dump`, and `unitig` all go through this single
+  entry point.