Files
obikmer/docmd/index.md
T
Eric Coissac ce45e2fbe1 refactor: centralize k-mer filtering logic and add validation
Refactor shared `FilterArgs` and `build_group_filter` to return a `Result` with explicit validation for fraction bounds, min/max ordering, and count constraints. Update conditional defaults for `--min-frac` and `--max-outgroup-count` to depend on explicit quorum flags, preventing silent configuration conflicts. Update documentation and MkDocs navigation to reflect the new centralized k-mer filtering system across `rebuild`, `dump`, and `unitig` commands.
2026-06-09 10:22:25 +02:00

3.9 KiB
Raw Blame History

obikmer

obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

Subcommands

Subcommand Purpose
superkmer Extract super-kmers from a sequence file and write to stdout
index Build a complete genome index (scatter → dereplicate → count → layered MPHF)
merge Merge multiple built indexes into one
rebuild Filter and compact an existing index into a new single-layer index; supports the shared kmer filtering system
query Query an index with sequences and annotate matches
dump Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared kmer filtering system; --head N limits output to the first N k-mers
annotate Add or update genome metadata from a CSV file; or dump metadata as CSV
distance Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; --presence-threshold N sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1)
unitig Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared kmer filtering system
estimate Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing
reindex Convert an index's evidence in-place: exact ↔ approx
utils Miscellaneous index utilities: --new-label NEW=OLD renames a genome label; --upgrade-index adds missing layer_meta.json to old indexes
pack Pack per-column matrix files into single-file format to reduce query I/O

Constraints

  • Target scale: individual genome datasets, tens of Gbases
  • Maximum efficiency in computation, memory, and disk usage
  • k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
  • Canonical form: min(kmer, revcomp(kmer)) reduces strand-symmetric space by half
  • Input formats for index/superkmer: FASTA (.fa, .fasta), FASTQ (.fq, .fastq), GenBank flat file (.gb, .gbk, .gbff), all optionally gzip-compressed; directories expanded recursively; streaming stdin via -
  • Input formats for query: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via -

Parameter constraints (enforced at CLI)

All constraints below are checked by CommonArgs::validate() at the start of superkmer and index. Invalid values exit immediately with an error.

Parameter Constraint Reason
k (--kmer-size) odd even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant
k (--kmer-size) k ∈ [11, 31] k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity
m (--minimizer-size) odd same palindrome argument as k
m (--minimizer-size) 3 ≤ m ≤ k1 minimizer must be strictly shorter than the kmer
z (-z, Findere, index --approx only) z ≤ k1 effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0

Genome label constraints

Genome labels are arbitrary Unicode strings with the following restrictions:

Character Forbidden Reason
/ yes filesystem path separator
= yes --new-label parser separator
\0 yes null byte
\n \r \t yes break CSV output
spaces allowed use shell quoting: --new-label 'new label=old label'

Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.

Priority operations

  • Kmer counting (frequencies)
  • Fast search / query
  • Set operations: union, intersection, difference