Files
obikmer/docmd/index.md
T
Eric Coissac 0d9be53d1f feat: enforce runtime validation for kmer and minimizer parameters
Introduces `CommonArgs::validate()` to enforce strict constraints on `--kmer-size` (odd, 11–31), `--minimizer-size` (odd, 3–k−1), and `z` (strictly less than k). This validation is applied at the entry point of the `superkmer` and `index` commands to prevent invalid configurations, avoid palindromes, prevent u64 overflow, and ensure positive effective indexing sizes. Documentation is updated to reflect these runtime checks and immediate termination on invalid input.
2026-05-29 09:10:25 +02:00

3.2 KiB
Raw Blame History

obikmer

obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

Subcommands

Subcommand Purpose
superkmer Extract super-kmers from a sequence file and write to stdout
index Build a complete genome index (scatter → dereplicate → count → layered MPHF)
merge Merge multiple built indexes into one
rebuild Filter and compact an existing index into a new single-layer index
query Query an index with sequences and annotate matches
dump Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
annotate Add or update genome metadata from a CSV file; or dump metadata as CSV
distance Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees
unitig Dump unitigs from a built index to stdout (debug)
estimate Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing
reindex Convert an index's evidence in-place: exact ↔ approx
utils Miscellaneous index utilities: --new-label NEW=OLD renames a genome label in-place (NEW gets OLD's identity)

Constraints

  • Target scale: individual genome datasets, tens of Gbases
  • Maximum efficiency in computation, memory, and disk usage
  • k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
  • Canonical form: min(kmer, revcomp(kmer)) reduces strand-symmetric space by half
  • Input formats: FASTA, FASTQ, gzip, streaming stdin; index reads from stdin automatically when no input files are provided (- can also be passed explicitly among other paths)

Parameter constraints (enforced at CLI)

All constraints below are checked by CommonArgs::validate() at the start of superkmer and index. Invalid values exit immediately with an error.

Parameter Constraint Reason
k (--kmer-size) odd even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant
k (--kmer-size) k ∈ [11, 31] k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity
m (--minimizer-size) odd same palindrome argument as k
m (--minimizer-size) 3 ≤ m ≤ k1 minimizer must be strictly shorter than the kmer
z (-z, Findere, index --approx only) z ≤ k1 effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0

Genome label constraints

Genome labels are arbitrary Unicode strings with the following restrictions:

Character Forbidden Reason
/ yes filesystem path separator
= yes --new-label parser separator
\0 yes null byte
\n \r \t yes break CSV output
spaces allowed use shell quoting: --new-label 'new label=old label'

Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.

Priority operations

  • Kmer counting (frequencies)
  • Fast search / query
  • Set operations: union, intersection, difference