Files

T

Eric Coissac 26ab165807 refactor: add rolling buffer methods and document label constraints

Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.

2026-05-26 15:40:23 +02:00

2.4 KiB

Raw Blame History

obikmer

obikmer is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.

Subcommands

Subcommand	Purpose
`superkmer`	Extract super-kmers from a sequence file and write to stdout
`index`	Build a complete genome index (scatter → dereplicate → count → layered MPHF)
`merge`	Merge multiple built indexes into one
`rebuild`	Filter and compact an existing index into a new single-layer index
`query`	Query an index with sequences and annotate matches
`dump`	Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
`annotate`	Add or update genome metadata from a CSV file; or dump metadata as CSV
`distance`	Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees
`unitig`	Dump unitigs from a built index to stdout (debug)
`estimate`	Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing
`reindex`	Convert an index's evidence in-place: exact ↔ approx
`utils`	Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity)

Constraints

Target scale: individual genome datasets, tens of Gbases
Maximum efficiency in computation, memory, and disk usage
k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
Canonical form: min(kmer, revcomp(kmer)) reduces strand-symmetric space by half
Input formats: FASTA, FASTQ, gzip, streaming stdin; index reads from stdin automatically when no input files are provided (- can also be passed explicitly among other paths)

Genome label constraints

Genome labels are arbitrary Unicode strings with the following restrictions:

Character	Forbidden	Reason
`/`	yes	filesystem path separator
`=`	yes	`--new-label` parser separator
`\0`	yes	null byte
`\n` `\r` `\t`	yes	break CSV output
spaces	allowed	use shell quoting: `--new-label 'new label=old label'`

Empty labels are also rejected. Labels derived automatically from the index directory name (when --label is omitted) are not validated since they come from the filesystem and are already safe.

Priority operations

Kmer counting (frequencies)
Fast search / query
Set operations: union, intersection, difference

2.4 KiB Raw Blame History