52fd2cf801
This commit improves memory management by respecting Linux cgroup v1/v2 limits and introduces a configurable memory budget for the new `rebuild` subcommand to prevent OOM during index reconstruction. The rebuild process now supports filtering, compaction, and parallelization. Diagnostic capabilities are expanded with debug-level tracing for partition merges, k-mer expansion tracking, and utility flags for label renaming, matrix size breakdowns, per-genome counts, and partition distribution reporting. Accessor methods for active and remaining memory have also been added to the stats struct.
124 lines
6.5 KiB
Markdown
124 lines
6.5 KiB
Markdown
# obikmer
|
||
|
||
`obikmer` is a Rust toolkit for indexing, querying, and comparing DNA sequences
|
||
represented as sets of k-mers. It targets individual genome datasets (tens of
|
||
Gbases) with maximum efficiency in computation, memory, and disk usage.
|
||
|
||
## Key principles
|
||
|
||
**Compact k-mer encoding.** Each k-mer is stored in a `u64` at 2 bits/base.
|
||
k is odd, k ∈ [11, 31], fixed at runtime. The canonical form `min(kmer, revcomp(kmer))`
|
||
halves the effective space by collapsing both strands.
|
||
|
||
**Superkmer-based partitioning.** Sequences are decomposed into superkmers —
|
||
maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to
|
||
partitions via the minimizer hash, enabling partition-parallel indexing and querying
|
||
with no cross-partition communication.
|
||
|
||
**Layered MPHF index.** Each partition holds a stack of layers. Each layer is a
|
||
minimal perfect hash function (MPHF) over the k-mers of one input genome, paired
|
||
with a per-genome presence/count matrix. Queries scatter k-mers to their partition,
|
||
probe each layer in order, and aggregate results.
|
||
|
||
**Approximate indexing (Findere).** With `-z Z`, the index stores k-mers of size
|
||
`s = k − z + 1` instead of k. At query time, results are produced at size s, then
|
||
a per-genome sliding window of size z aggregates z consecutive s-mer hits into one
|
||
confirmed k-mer answer. This reduces the false-positive rate from `1/2^b` per s-mer
|
||
to `1/2^(b·z)` per k-mer, at the cost of z−1 unconfirmed positions at each sequence
|
||
break. The aggregation window spans the full query sequence, not individual superkmers,
|
||
to avoid false negatives at superkmer boundaries.
|
||
|
||
**Multi-genome.** A single index can hold any number of genomes. Each k-mer slot
|
||
carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees,
|
||
and classification are derived from these vectors without rebuilding the index.
|
||
|
||
## Input formats
|
||
|
||
Command Formats accepted
|
||
─────────────────── ──────────────────────────────────────────────────────────────
|
||
index, superkmer FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
|
||
(.gb .gbk .gbff), all optionally gzip-compressed.
|
||
Directories expanded recursively. Streaming stdin via -.
|
||
query FASTA, FASTQ, optionally gzip-compressed. Stdin via -.
|
||
|
||
Non-ACGT characters act as hard breaks between k-mer segments in all formats.
|
||
|
||
## Commands
|
||
|
||
Command Role
|
||
───────── ────────────────────────────────────────────────────────────────────
|
||
index Build a genome index from sequence files.
|
||
Runs scatter → dereplicate → count → layered MPHF.
|
||
Resumes automatically if interrupted.
|
||
merge Merge multiple independently built indexes into one.
|
||
Schedules partitions largest-first under a memory budget semaphore
|
||
to avoid OOM on machines with many cores. The worst partition runs
|
||
alone first to calibrate the expansion estimator; subsequent
|
||
partitions run in parallel within the budget.
|
||
--budget-fraction F fraction of available RAM to use as budget
|
||
(default 0.5; reduce if OOM persists).
|
||
filter Filter and compact an existing index: apply count thresholds,
|
||
drop layers, rewrite as a single-layer index.
|
||
reindex Convert evidence in-place across all layers:
|
||
exact (evidence.bin) ↔ approximate (fingerprint.bin).
|
||
Does not touch the MPHF or unitigs.
|
||
query Query an index with FASTA/FASTQ sequences.
|
||
Annotates each sequence with per-genome k-mer match counts
|
||
and optional per-position coverage vectors (--detail).
|
||
Parallel over sequence chunks.
|
||
distance Compute a pairwise Bray-Curtis or Jaccard distance matrix
|
||
between all indexed genomes.
|
||
Optionally outputs a Newick NJ or UPGMA tree.
|
||
annotate Add or update genome metadata (taxonomy, etc.) from a CSV
|
||
file; or dump the current metadata as CSV.
|
||
estimate Dry-run: resolve and print approximate-index parameters
|
||
(z, evidence bits b, FP rates) given any two of (b, z, fp).
|
||
Does not touch any index.
|
||
dump Dump all indexed k-mers as CSV with per-genome counts or
|
||
presence flags.
|
||
superkmer Extract superkmers from a sequence file and write to stdout.
|
||
Diagnostic / pipeline use.
|
||
unitig Dump the unitig sequences stored in a built index. Debug use.
|
||
utils Miscellaneous utilities.
|
||
--new-label NEW=OLD rename a genome label in-place.
|
||
--bits-per-kmer print MPHF / evidence / matrix size breakdown.
|
||
--stats per-genome k-mer counts as CSV.
|
||
--partition-stats partition size distribution across one or more
|
||
indexes (markdown report to stdout). Useful to
|
||
diagnose minimizer imbalance before a large merge.
|
||
--csv FILE write per-(partition, source) raw data to FILE
|
||
(used with --partition-stats).
|
||
|
||
## Quick start
|
||
|
||
```sh
|
||
# Build an exact index for each genome independently
|
||
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/
|
||
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/
|
||
|
||
# Merge into a single multi-genome index
|
||
obikmer merge --output index/ index_a/ index_b/
|
||
|
||
# Convert to approximate index (z=5, 8-bit fingerprints)
|
||
obikmer reindex --approx -z 5 --evidence-bits 8 index/
|
||
|
||
# Query reads
|
||
obikmer query index/ reads.fq.gz > annotated.fa
|
||
|
||
# Pairwise distances
|
||
obikmer distance index/ > distances.tsv
|
||
```
|
||
|
||
## Parameter constraints
|
||
|
||
Parameter Constraint
|
||
───────────────────── ──────────────
|
||
k (--kmer-size) odd, 11 ≤ k ≤ 31
|
||
m (--minimizer-size) odd, 3 ≤ m ≤ k−1
|
||
z (-z, --approx only) 1 ≤ z ≤ k−1
|
||
|
||
## Documentation
|
||
|
||
Extended architecture and implementation notes are in `docmd/`. Build with
|
||
`make doc` (requires Python + MkDocs Material).
|