feat: add --head and --presence-threshold to dump and distance
Introduces `--head N` to the `dump` command for early iteration termination and `--presence-threshold N` to the `distance` command for Jaccard filtering on count indexes. Updates filter defaults to adapt based on explicit ingroup/outgroup declarations. Fixes a Rust type mismatch in the unitig closure and updates partition iteration callbacks to return `bool` for proper early termination support. Documentation is updated accordingly.
This commit is contained in:
@@ -96,8 +96,16 @@ For each genome:
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
|
||||
(no upper bound). A filter with all defaults is a no-op.
|
||||
**Conditional defaults** — the default for `--min-frac` and `--max-outgroup-count` depends on whether the corresponding group was explicitly declared:
|
||||
|
||||
| Situation | `--min-frac` default | `--max-outgroup-count` default |
|
||||
|-----------|----------------------|-------------------------------|
|
||||
| Neither `--ingroup` nor `--outgroup` | 0.0 (no-op) | no constraint (no-op) |
|
||||
| `--ingroup` only | **1.0** — all ingroup genomes must carry the k-mer | no constraint |
|
||||
| `--outgroup` only | 0.0 | **0** — no outgroup genome may carry the k-mer |
|
||||
| Both declared | **1.0** | **0** |
|
||||
|
||||
Explicit flags always override these defaults. All other bounds (`--min-count`, `--max-count`, `--max-frac`, `--min-outgroup-*`) default to 0 / group size / 1.0 regardless of whether groups are declared.
|
||||
|
||||
Fractions are computed over the size of the classified group, not over total
|
||||
genome count. An empty group (no genome classified as ingroup/outgroup) never
|
||||
@@ -169,6 +177,31 @@ obikmer unitig myindex \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
## Command-specific options
|
||||
|
||||
### `dump --head N`
|
||||
|
||||
Stops output after the first N k-mers that pass all active filters.
|
||||
Iteration terminates immediately — subsequent partitions and layers are not scanned.
|
||||
Useful for quick inspection of large indexes without loading the entire dataset.
|
||||
|
||||
```sh
|
||||
obikmer dump myindex --head 100
|
||||
obikmer dump myindex --head 20 --ingroup "species=Betula_nana" --min-count 1
|
||||
```
|
||||
|
||||
### `distance --presence-threshold N`
|
||||
|
||||
When computing Jaccard distance on a **count index**, a k-mer is considered present in a genome if its count is ≥ N (default 1).
|
||||
This option is independent of the `--presence-threshold` used in `rebuild`/`query` filtering.
|
||||
|
||||
```sh
|
||||
# Jaccard treating kmers with count ≥ 2 as present
|
||||
obikmer distance myindex --metric jaccard --presence-threshold 2
|
||||
```
|
||||
|
||||
This parameter has no effect on presence/absence indexes (where values are already 0/1) or on metrics other than Jaccard.
|
||||
|
||||
## Implementation
|
||||
|
||||
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
|
||||
|
||||
+2
-2
@@ -11,9 +11,9 @@
|
||||
| `merge` | Merge multiple built indexes into one |
|
||||
| `rebuild` | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata |
|
||||
| `query` | Query an index with sequences and annotate matches |
|
||||
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild` |
|
||||
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild`; `--head N` limits output to the first N k-mers |
|
||||
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
||||
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
|
||||
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1) |
|
||||
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` |
|
||||
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
|
||||
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
|
||||
|
||||
Reference in New Issue
Block a user