# obikmer `obikmer` is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets. ## Subcommands | Subcommand | Purpose | |-------------|---------| | `superkmer` | Extract super-kmers from a sequence file and write to stdout | | `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) | | `merge` | Merge multiple built indexes into one | | `rebuild` | Filter and compact an existing index into a new single-layer index | | `query` | Query an index with sequences and annotate matches | | `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) | | `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV | | `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees | | `unitig` | Dump unitigs from a built index to stdout (debug) | | `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing | | `reindex` | Convert an index's evidence in-place: exact ↔ approx | | `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) | ## Constraints - Target scale: individual genome datasets, tens of Gbases - Maximum efficiency in computation, memory, and disk usage - k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base) - Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half - Input formats: FASTA, FASTQ, gzip, streaming stdin; `index` reads from stdin automatically when no input files are provided (`-` can also be passed explicitly among other paths) ## Parameter constraints (enforced at CLI) All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error. | Parameter | Constraint | Reason | |-----------|-----------|--------| | k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant | | k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity | | m (`--minimizer-size`) | odd | same palindrome argument as k | | m (`--minimizer-size`) | 3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer | | z (`-z`, Findere, `index --approx` only) | z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 | ## Genome label constraints Genome labels are arbitrary Unicode strings with the following restrictions: | Character | Forbidden | Reason | |-----------|-----------|--------| | `/` | yes | filesystem path separator | | `=` | yes | `--new-label` parser separator | | `\0` | yes | null byte | | `\n` `\r` `\t` | yes | break CSV output | | spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` | Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe. ## Priority operations - Kmer counting (frequencies) - Fast search / query - Set operations: union, intersection, difference