Files

T

Eric Coissac e66adef23d feat: add select command for genome column projection and aggregation

Introduces the `select` CLI command to project and aggregate genome-level k-mer data by column. Adds `filter` as an alias for `rebuild`. The implementation uses parallel partition processing, supports metadata-driven grouping with configurable aggregation operators, and performs atomic in-place rewrites or filtered exports. Updates documentation and navigation accordingly.

2026-06-09 15:09:47 +02:00

4.1 KiB

Raw Blame History

obikmer

obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

Subcommands

Subcommand	Purpose
`superkmer`	Extract super-kmers from a sequence file and write to stdout
`index`	Build a complete genome index (scatter → dereplicate → count → layered MPHF)
`merge`	Merge multiple built indexes into one
`rebuild` / `filter`	Filter and compact an existing index into a new single-layer index; supports the shared kmer filtering system (`filter` is an alias for `rebuild`)
`query`	Query an index with sequences and annotate matches
`dump`	Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the shared kmer filtering system; `--head N` limits output to the first N k-mers
`annotate`	Add or update genome metadata from a CSV file; or dump metadata as CSV
`distance`	Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees; `--presence-threshold N` sets the minimum count to consider a k-mer present when computing Jaccard on count indexes (default 1)
`unitig`	Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the shared kmer filtering system
`select`	Project and/or aggregate genome columns into a new or in-place index; the column-axis counterpart of `filter` (see select)
`estimate`	Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing
`reindex`	Convert an index's evidence in-place: exact ↔ approx
`utils`	Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes
`pack`	Pack per-column matrix files into single-file format to reduce query I/O

Constraints

Target scale: individual genome datasets, tens of Gbases
Maximum efficiency in computation, memory, and disk usage
k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
Canonical form: min(kmer, revcomp(kmer)) reduces strand-symmetric space by half
Input formats for index/superkmer: FASTA (.fa, .fasta), FASTQ (.fq, .fastq), GenBank flat file (.gb, .gbk, .gbff), all optionally gzip-compressed; directories expanded recursively; streaming stdin via -
Input formats for query: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via -

Parameter constraints (enforced at CLI)

All constraints below are checked by CommonArgs::validate() at the start of superkmer and index. Invalid values exit immediately with an error.

Parameter	Constraint	Reason
k (`--kmer-size`)	odd	even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant
k (`--kmer-size`)	k ∈ [11, 31]	k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity
m (`--minimizer-size`)	odd	same palindrome argument as k
m (`--minimizer-size`)	3 ≤ m ≤ k−1	minimizer must be strictly shorter than the kmer
z (`-z`, Findere, `index --approx` only)	z ≤ k−1	effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0

Genome label constraints

Genome labels are arbitrary Unicode strings with the following restrictions:

Character	Forbidden	Reason
`/`	yes	filesystem path separator
`=`	yes	`--new-label` parser separator
`\0`	yes	null byte
`\n` `\r` `\t`	yes	break CSV output
spaces	allowed	use shell quoting: `--new-label 'new label=old label'`

Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.

Priority operations

Kmer counting (frequencies)
Fast search / query
Set operations: union, intersection, difference

4.1 KiB Raw Blame History Unescape Escape