Eric Coissac 95b3461405 refactor: centralize graph traversal logic in walk
Refactor `leavable` and `reachable` to eliminate duplicated graph traversal logic by mutually delegating via `WalkState`. `leavable` now returns `self.walk(graph).is_some()`, while `reachable` delegates to the inverted `direct` state's `leavable` check. This centralizes kmer extension and visited-state validation in `walk`, simplifying control flow and reducing code duplication.
2026-06-06 06:36:48 +02:00

obikmer

obikmer is a Rust toolkit for indexing, querying, and comparing DNA sequences represented as sets of k-mers. It targets individual genome datasets (tens of Gbases) with maximum efficiency in computation, memory, and disk usage.

Key principles

Compact k-mer encoding. Each k-mer is stored in a u64 at 2 bits/base. k is odd, k ∈ [11, 31], fixed at runtime. The canonical form min(kmer, revcomp(kmer)) halves the effective space by collapsing both strands.

Superkmer-based partitioning. Sequences are decomposed into superkmers — maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to partitions via the minimizer hash, enabling partition-parallel indexing and querying with no cross-partition communication.

Layered MPHF index. Each partition holds a stack of layers. Each layer is a minimal perfect hash function (MPHF) over the k-mers of one input genome, paired with a per-genome presence/count matrix. Queries scatter k-mers to their partition, probe each layer in order, and aggregate results.

Approximate indexing (Findere). With -z Z, the index stores k-mers of size s = k z + 1 instead of k. At query time, results are produced at size s, then a per-genome sliding window of size z aggregates z consecutive s-mer hits into one confirmed k-mer answer. This reduces the false-positive rate from 1/2^b per s-mer to 1/2^(b·z) per k-mer, at the cost of z1 unconfirmed positions at each sequence break. The aggregation window spans the full query sequence, not individual superkmers, to avoid false negatives at superkmer boundaries.

Multi-genome. A single index can hold any number of genomes. Each k-mer slot carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees, and classification are derived from these vectors without rebuilding the index.

Input formats

Command              Formats accepted
───────────────────  ──────────────────────────────────────────────────────────────
index, superkmer     FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
                     (.gb .gbk .gbff), all optionally gzip-compressed.
                     Directories expanded recursively. Streaming stdin via -.
query                FASTA, FASTQ, optionally gzip-compressed. Stdin via -.

Non-ACGT characters act as hard breaks between k-mer segments in all formats.

Commands

Command    Role
─────────  ────────────────────────────────────────────────────────────────────
index      Build a genome index from sequence files.
           Runs scatter → dereplicate → count → layered MPHF.
           Resumes automatically if interrupted.
merge      Merge multiple independently built indexes into one.
rebuild    Filter and compact an existing index: apply count thresholds,
           drop layers, rewrite as a single-layer index.
reindex    Convert evidence in-place across all layers:
           exact (evidence.bin) ↔ approximate (fingerprint.bin).
           Does not touch the MPHF or unitigs.
query      Query an index with FASTA/FASTQ sequences.
           Annotates each sequence with per-genome k-mer match counts
           and optional per-position coverage vectors (--detail).
           Parallel over sequence chunks.
distance   Compute a pairwise Bray-Curtis or Jaccard distance matrix
           between all indexed genomes.
           Optionally outputs a Newick NJ or UPGMA tree.
annotate   Add or update genome metadata (taxonomy, etc.) from a CSV
           file; or dump the current metadata as CSV.
estimate   Dry-run: resolve and print approximate-index parameters
           (z, evidence bits b, FP rates) given any two of (b, z, fp).
           Does not touch any index.
dump       Dump all indexed k-mers as CSV with per-genome counts or
           presence flags.
superkmer  Extract superkmers from a sequence file and write to stdout.
           Diagnostic / pipeline use.
unitig     Dump the unitig sequences stored in a built index. Debug use.
utils      Miscellaneous utilities.
           --new-label NEW=OLD  renames a genome label in-place.

Quick start

# Build an exact index for each genome independently
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/

# Merge into a single multi-genome index
obikmer merge --output index/ index_a/ index_b/

# Convert to approximate index (z=5, 8-bit fingerprints)
obikmer reindex --approx -z 5 --evidence-bits 8 index/

# Query reads
obikmer query index/ reads.fq.gz > annotated.fa

# Pairwise distances
obikmer distance index/ > distances.tsv

Parameter constraints

Parameter              Constraint
─────────────────────  ──────────────
k  (--kmer-size)       odd, 11 ≤ k ≤ 31
m  (--minimizer-size)  odd, 3 ≤ m ≤ k1
z  (-z, --approx only) 1 ≤ z ≤ k1

Documentation

Extended architecture and implementation notes are in docmd/. Build with make doc (requires Python + MkDocs Material).

S
Description
No description provided
Readme 228 MiB
Languages
Rust 93.1%
HTML 3.8%
TeX 1.5%
Python 0.7%
Shell 0.6%
Other 0.2%