# obikmer `obikmer` is a Rust toolkit for indexing, querying, and comparing DNA sequences represented as sets of k-mers. It targets individual genome datasets (tens of Gbases) with maximum efficiency in computation, memory, and disk usage. ## Key principles **Compact k-mer encoding.** Each k-mer is stored in a `u64` at 2 bits/base. k is odd, k ∈ [11, 31], fixed at runtime. The canonical form `min(kmer, revcomp(kmer))` halves the effective space by collapsing both strands. **Superkmer-based partitioning.** Sequences are decomposed into superkmers — maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to partitions via the minimizer hash, enabling partition-parallel indexing and querying with no cross-partition communication. **Layered MPHF index.** Each partition holds a stack of layers. Each layer is a minimal perfect hash function (MPHF) over the k-mers of one input genome, paired with a per-genome presence/count matrix. Queries scatter k-mers to their partition, probe each layer in order, and aggregate results. **Approximate indexing (Findere).** With `-z Z`, the index stores k-mers of size `s = k − z + 1` instead of k. At query time, results are produced at size s, then a per-genome sliding window of size z aggregates z consecutive s-mer hits into one confirmed k-mer answer. This reduces the false-positive rate from `1/2^b` per s-mer to `1/2^(b·z)` per k-mer, at the cost of z−1 unconfirmed positions at each sequence break. The aggregation window spans the full query sequence, not individual superkmers, to avoid false negatives at superkmer boundaries. **Multi-genome.** A single index can hold any number of genomes. Each k-mer slot carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees, and classification are derived from these vectors without rebuilding the index. ## Input formats Command Formats accepted ─────────────────── ────────────────────────────────────────────────────────────── index, superkmer FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file (.gb .gbk .gbff), all optionally gzip-compressed. Directories expanded recursively. Streaming stdin via -. query FASTA, FASTQ, optionally gzip-compressed. Stdin via -. Non-ACGT characters act as hard breaks between k-mer segments in all formats. ## Commands Command Role ───────── ──────────────────────────────────────────────────────────────────── index Build a genome index from sequence files. Runs scatter → dereplicate → count → layered MPHF. Resumes automatically if interrupted. merge Merge multiple independently built indexes into one. rebuild Filter and compact an existing index: apply count thresholds, drop layers, rewrite as a single-layer index. reindex Convert evidence in-place across all layers: exact (evidence.bin) ↔ approximate (fingerprint.bin). Does not touch the MPHF or unitigs. query Query an index with FASTA/FASTQ sequences. Annotates each sequence with per-genome k-mer match counts and optional per-position coverage vectors (--detail). Parallel over sequence chunks. distance Compute a pairwise Bray-Curtis or Jaccard distance matrix between all indexed genomes. Optionally outputs a Newick NJ or UPGMA tree. annotate Add or update genome metadata (taxonomy, etc.) from a CSV file; or dump the current metadata as CSV. estimate Dry-run: resolve and print approximate-index parameters (z, evidence bits b, FP rates) given any two of (b, z, fp). Does not touch any index. dump Dump all indexed k-mers as CSV with per-genome counts or presence flags. superkmer Extract superkmers from a sequence file and write to stdout. Diagnostic / pipeline use. unitig Dump the unitig sequences stored in a built index. Debug use. utils Miscellaneous utilities. --new-label NEW=OLD renames a genome label in-place. ## Quick start ```sh # Build an exact index for each genome independently obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/ obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/ # Merge into a single multi-genome index obikmer merge --output index/ index_a/ index_b/ # Convert to approximate index (z=5, 8-bit fingerprints) obikmer reindex --approx -z 5 --evidence-bits 8 index/ # Query reads obikmer query index/ reads.fq.gz > annotated.fa # Pairwise distances obikmer distance index/ > distances.tsv ``` ## Parameter constraints Parameter Constraint ───────────────────── ────────────── k (--kmer-size) odd, 11 ≤ k ≤ 31 m (--minimizer-size) odd, 3 ≤ m ≤ k−1 z (-z, --approx only) 1 ≤ z ≤ k−1 ## Documentation Extended architecture and implementation notes are in `docmd/`. Build with `make doc` (requires Python + MkDocs Material).