Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.
obikmer
obikmer is a Rust toolkit for indexing, querying, and comparing DNA sequences
represented as sets of k-mers. It targets individual genome datasets (tens of
Gbases) with maximum efficiency in computation, memory, and disk usage.
Key principles
Compact k-mer encoding. Each k-mer is stored in a u64 at 2 bits/base.
k is odd, k ∈ [11, 31], fixed at runtime. The canonical form min(kmer, revcomp(kmer))
halves the effective space by collapsing both strands.
Superkmer-based partitioning. Sequences are decomposed into superkmers — maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to partitions via the minimizer hash, enabling partition-parallel indexing and querying with no cross-partition communication.
Layered MPHF index. Each partition holds a stack of layers. Each layer is a minimal perfect hash function (MPHF) over the k-mers of one input genome, paired with a per-genome presence/count matrix. Queries scatter k-mers to their partition, probe each layer in order, and aggregate results.
Approximate indexing (Findere). With -z Z, the index stores k-mers of size
s = k − z + 1 instead of k. At query time, results are produced at size s, then
a per-genome sliding window of size z aggregates z consecutive s-mer hits into one
confirmed k-mer answer. This reduces the false-positive rate from 1/2^b per s-mer
to 1/2^(b·z) per k-mer, at the cost of z−1 unconfirmed positions at each sequence
break. The aggregation window spans the full query sequence, not individual superkmers,
to avoid false negatives at superkmer boundaries.
Multi-genome. A single index can hold any number of genomes. Each k-mer slot carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees, and classification are derived from these vectors without rebuilding the index.
Input formats
Command Formats accepted
─────────────────── ──────────────────────────────────────────────────────────────
index, superkmer FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
(.gb .gbk .gbff), all optionally gzip-compressed.
Directories expanded recursively. Streaming stdin via -.
query FASTA, FASTQ, optionally gzip-compressed. Stdin via -.
Non-ACGT characters act as hard breaks between k-mer segments in all formats.
Commands
Command Role
───────── ────────────────────────────────────────────────────────────────────
index Build a genome index from sequence files.
Runs scatter → dereplicate → count → layered MPHF.
Resumes automatically if interrupted.
merge Merge multiple independently built indexes into one.
rebuild Filter and compact an existing index: apply count thresholds,
drop layers, rewrite as a single-layer index.
reindex Convert evidence in-place across all layers:
exact (evidence.bin) ↔ approximate (fingerprint.bin).
Does not touch the MPHF or unitigs.
query Query an index with FASTA/FASTQ sequences.
Annotates each sequence with per-genome k-mer match counts
and optional per-position coverage vectors (--detail).
Parallel over sequence chunks.
distance Compute a pairwise Bray-Curtis or Jaccard distance matrix
between all indexed genomes.
Optionally outputs a Newick NJ or UPGMA tree.
annotate Add or update genome metadata (taxonomy, etc.) from a CSV
file; or dump the current metadata as CSV.
estimate Dry-run: resolve and print approximate-index parameters
(z, evidence bits b, FP rates) given any two of (b, z, fp).
Does not touch any index.
dump Dump all indexed k-mers as CSV with per-genome counts or
presence flags.
superkmer Extract superkmers from a sequence file and write to stdout.
Diagnostic / pipeline use.
unitig Dump the unitig sequences stored in a built index. Debug use.
utils Miscellaneous utilities.
--new-label NEW=OLD renames a genome label in-place.
Quick start
# Build an exact index for each genome independently
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/
# Merge into a single multi-genome index
obikmer merge --output index/ index_a/ index_b/
# Convert to approximate index (z=5, 8-bit fingerprints)
obikmer reindex --approx -z 5 --evidence-bits 8 index/
# Query reads
obikmer query index/ reads.fq.gz > annotated.fa
# Pairwise distances
obikmer distance index/ > distances.tsv
Parameter constraints
Parameter Constraint
───────────────────── ──────────────
k (--kmer-size) odd, 11 ≤ k ≤ 31
m (--minimizer-size) odd, 3 ≤ m ≤ k−1
z (-z, --approx only) 1 ≤ z ≤ k−1
Documentation
Extended architecture and implementation notes are in docmd/. Build with
make doc (requires Python + MkDocs Material).