0d9be53d1f
Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.
61 lines
3.2 KiB
Markdown
61 lines
3.2 KiB
Markdown
# obikmer
|
||
|
||
`obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
|
||
|
||
## Subcommands
|
||
|
||
| Subcommand | Purpose |
|
||
|-------------|---------|
|
||
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
|
||
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
|
||
| `merge` | Merge multiple built indexes into one |
|
||
| `rebuild` | Filter and compact an existing index into a new single-layer index |
|
||
| `query` | Query an index with sequences and annotate matches |
|
||
| `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) |
|
||
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
||
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
|
||
| `unitig` | Dump unitigs from a built index to stdout (debug) |
|
||
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
|
||
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
|
||
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) |
|
||
|
||
## Constraints
|
||
|
||
- Target scale: individual genome datasets, tens of Gbases
|
||
- Maximum efficiency in computation, memory, and disk usage
|
||
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
|
||
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
|
||
- Input formats: FASTA, FASTQ, gzip, streaming stdin; `index` reads from stdin automatically when no input files are provided (`-` can also be passed explicitly among other paths)
|
||
|
||
## Parameter constraints (enforced at CLI)
|
||
|
||
All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.
|
||
|
||
| Parameter | Constraint | Reason |
|
||
|-----------|-----------|--------|
|
||
| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
|
||
| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
|
||
| m (`--minimizer-size`) | odd | same palindrome argument as k |
|
||
| m (`--minimizer-size`) | 3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer |
|
||
| z (`-z`, Findere, `index --approx` only) | z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 |
|
||
|
||
## Genome label constraints
|
||
|
||
Genome labels are arbitrary Unicode strings with the following restrictions:
|
||
|
||
| Character | Forbidden | Reason |
|
||
|-----------|-----------|--------|
|
||
| `/` | yes | filesystem path separator |
|
||
| `=` | yes | `--new-label` parser separator |
|
||
| `\0` | yes | null byte |
|
||
| `\n` `\r` `\t` | yes | break CSV output |
|
||
| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |
|
||
|
||
Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.
|
||
|
||
## Priority operations
|
||
|
||
- Kmer counting (frequencies)
|
||
- Fast search / query
|
||
- Set operations: union, intersection, difference
|