2026-05-30 15:54:13 +02:00
|
|
|
|
# obikmer
|
|
|
|
|
|
|
|
|
|
|
|
`obikmer` is a Rust toolkit for indexing, querying, and comparing DNA sequences
|
|
|
|
|
|
represented as sets of k-mers. It targets individual genome datasets (tens of
|
|
|
|
|
|
Gbases) with maximum efficiency in computation, memory, and disk usage.
|
|
|
|
|
|
|
|
|
|
|
|
## Key principles
|
|
|
|
|
|
|
|
|
|
|
|
**Compact k-mer encoding.** Each k-mer is stored in a `u64` at 2 bits/base.
|
|
|
|
|
|
k is odd, k ∈ [11, 31], fixed at runtime. The canonical form `min(kmer, revcomp(kmer))`
|
|
|
|
|
|
halves the effective space by collapsing both strands.
|
|
|
|
|
|
|
|
|
|
|
|
**Superkmer-based partitioning.** Sequences are decomposed into superkmers —
|
|
|
|
|
|
maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to
|
|
|
|
|
|
partitions via the minimizer hash, enabling partition-parallel indexing and querying
|
|
|
|
|
|
with no cross-partition communication.
|
|
|
|
|
|
|
|
|
|
|
|
**Layered MPHF index.** Each partition holds a stack of layers. Each layer is a
|
|
|
|
|
|
minimal perfect hash function (MPHF) over the k-mers of one input genome, paired
|
|
|
|
|
|
with a per-genome presence/count matrix. Queries scatter k-mers to their partition,
|
|
|
|
|
|
probe each layer in order, and aggregate results.
|
|
|
|
|
|
|
|
|
|
|
|
**Approximate indexing (Findere).** With `-z Z`, the index stores k-mers of size
|
|
|
|
|
|
`s = k − z + 1` instead of k. At query time, results are produced at size s, then
|
|
|
|
|
|
a per-genome sliding window of size z aggregates z consecutive s-mer hits into one
|
|
|
|
|
|
confirmed k-mer answer. This reduces the false-positive rate from `1/2^b` per s-mer
|
|
|
|
|
|
to `1/2^(b·z)` per k-mer, at the cost of z−1 unconfirmed positions at each sequence
|
|
|
|
|
|
break. The aggregation window spans the full query sequence, not individual superkmers,
|
|
|
|
|
|
to avoid false negatives at superkmer boundaries.
|
|
|
|
|
|
|
|
|
|
|
|
**Multi-genome.** A single index can hold any number of genomes. Each k-mer slot
|
|
|
|
|
|
carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees,
|
|
|
|
|
|
and classification are derived from these vectors without rebuilding the index.
|
|
|
|
|
|
|
|
|
|
|
|
## Input formats
|
|
|
|
|
|
|
|
|
|
|
|
Command Formats accepted
|
|
|
|
|
|
─────────────────── ──────────────────────────────────────────────────────────────
|
|
|
|
|
|
index, superkmer FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
|
|
|
|
|
|
(.gb .gbk .gbff), all optionally gzip-compressed.
|
|
|
|
|
|
Directories expanded recursively. Streaming stdin via -.
|
|
|
|
|
|
query FASTA, FASTQ, optionally gzip-compressed. Stdin via -.
|
|
|
|
|
|
|
|
|
|
|
|
Non-ACGT characters act as hard breaks between k-mer segments in all formats.
|
|
|
|
|
|
|
|
|
|
|
|
## Commands
|
|
|
|
|
|
|
|
|
|
|
|
Command Role
|
|
|
|
|
|
───────── ────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
index Build a genome index from sequence files.
|
|
|
|
|
|
Runs scatter → dereplicate → count → layered MPHF.
|
|
|
|
|
|
Resumes automatically if interrupted.
|
|
|
|
|
|
merge Merge multiple independently built indexes into one.
|
|
|
|
|
|
rebuild Filter and compact an existing index: apply count thresholds,
|
|
|
|
|
|
drop layers, rewrite as a single-layer index.
|
|
|
|
|
|
reindex Convert evidence in-place across all layers:
|
|
|
|
|
|
exact (evidence.bin) ↔ approximate (fingerprint.bin).
|
|
|
|
|
|
Does not touch the MPHF or unitigs.
|
|
|
|
|
|
query Query an index with FASTA/FASTQ sequences.
|
|
|
|
|
|
Annotates each sequence with per-genome k-mer match counts
|
|
|
|
|
|
and optional per-position coverage vectors (--detail).
|
|
|
|
|
|
Parallel over sequence chunks.
|
|
|
|
|
|
distance Compute a pairwise Bray-Curtis or Jaccard distance matrix
|
|
|
|
|
|
between all indexed genomes.
|
|
|
|
|
|
Optionally outputs a Newick NJ or UPGMA tree.
|
|
|
|
|
|
annotate Add or update genome metadata (taxonomy, etc.) from a CSV
|
|
|
|
|
|
file; or dump the current metadata as CSV.
|
|
|
|
|
|
estimate Dry-run: resolve and print approximate-index parameters
|
|
|
|
|
|
(z, evidence bits b, FP rates) given any two of (b, z, fp).
|
|
|
|
|
|
Does not touch any index.
|
|
|
|
|
|
dump Dump all indexed k-mers as CSV with per-genome counts or
|
|
|
|
|
|
presence flags.
|
|
|
|
|
|
superkmer Extract superkmers from a sequence file and write to stdout.
|
|
|
|
|
|
Diagnostic / pipeline use.
|
|
|
|
|
|
unitig Dump the unitig sequences stored in a built index. Debug use.
|
|
|
|
|
|
utils Miscellaneous utilities.
|
|
|
|
|
|
--new-label NEW=OLD renames a genome label in-place.
|
|
|
|
|
|
|
|
|
|
|
|
## Quick start
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
# Build an exact index for two genomes
|
|
|
|
|
|
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index/
|
|
|
|
|
|
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index/
|
|
|
|
|
|
|
|
|
|
|
|
# Convert to approximate index (z=5, 8-bit fingerprints)
|
|
|
|
|
|
obikmer reindex --approx -z 5 --evidence-bits 8 index/
|
|
|
|
|
|
|
|
|
|
|
|
# Query reads
|
|
|
|
|
|
obikmer query index/ reads.fq.gz > annotated.fa
|
|
|
|
|
|
|
|
|
|
|
|
# Pairwise distances
|
|
|
|
|
|
obikmer distance index/ > distances.tsv
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Parameter constraints
|
|
|
|
|
|
|
|
|
|
|
|
Parameter Constraint
|
|
|
|
|
|
───────────────────── ──────────────
|
|
|
|
|
|
k (--kmer-size) odd, 11 ≤ k ≤ 31
|
|
|
|
|
|
m (--minimizer-size) odd, 3 ≤ m ≤ k−1
|
|
|
|
|
|
z (-z, --approx only) 1 ≤ z ≤ k−1
|
|
|
|
|
|
|
|
|
|
|
|
## Documentation
|
|
|
|
|
|
|
|
|
|
|
|
Extended architecture and implementation notes are in `docmd/`. Build with
|
|
|
|
|
|
`make doc` (requires Python + MkDocs Material).
|