8a0b898b4b
Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
3.3 KiB
3.3 KiB
obikmer
obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.
Subcommands
| Subcommand | Purpose |
|---|---|
superkmer |
Extract super-kmers from a sequence file and write to stdout |
index |
Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
merge |
Merge multiple built indexes into one |
rebuild |
Filter and compact an existing index into a new single-layer index |
query |
Query an index with sequences and annotate matches |
dump |
Dump all indexed kmers as CSV (kmer + per-genome counts or presence) |
annotate |
Add or update genome metadata from a CSV file; or dump metadata as CSV |
distance |
Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
unitig |
Dump unitigs from a built index to stdout (debug) |
estimate |
Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
reindex |
Convert an index's evidence in-place: exact ↔ approx |
utils |
Miscellaneous index utilities: --new-label NEW=OLD renames a genome label in-place (NEW gets OLD's identity) |
Constraints
- Target scale: individual genome datasets, tens of Gbases
- Maximum efficiency in computation, memory, and disk usage
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
- Canonical form:
min(kmer, revcomp(kmer))reduces strand-symmetric space by half - Input formats for
index/superkmer: FASTA (.fa,.fasta), FASTQ (.fq,.fastq), GenBank flat file (.gb,.gbk,.gbff), all optionally gzip-compressed; directories expanded recursively; streaming stdin via- - Input formats for
query: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via-
Parameter constraints (enforced at CLI)
All constraints below are checked by CommonArgs::validate() at the start of superkmer and index. Invalid values exit immediately with an error.
| Parameter | Constraint | Reason |
|---|---|---|
k (--kmer-size) |
odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
k (--kmer-size) |
k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
m (--minimizer-size) |
odd | same palindrome argument as k |
m (--minimizer-size) |
3 ≤ m ≤ k−1 | minimizer must be strictly shorter than the kmer |
z (-z, Findere, index --approx only) |
z ≤ k−1 | effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0 |
Genome label constraints
Genome labels are arbitrary Unicode strings with the following restrictions:
| Character | Forbidden | Reason |
|---|---|---|
/ |
yes | filesystem path separator |
= |
yes | --new-label parser separator |
\0 |
yes | null byte |
\n \r \t |
yes | break CSV output |
| spaces | allowed | use shell quoting: --new-label 'new label=old label' |
Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.
Priority operations
- Kmer counting (frequencies)
- Fast search / query
- Set operations: union, intersection, difference