- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
5.3 KiB
obik: K-mer Index Management Toolkit for Biological Sequences
obik is a CLI tool from the OBITools4 ecosystem designed for building, inspecting, filtering, and manipulating k-mer indices—compact data structures encoding k-mer occurrences from biological sequences (e.g., FASTA/FASTQ). It enables scalable, parallelized processing of large-scale sequencing data for applications such as taxonomic profiling, contamination screening, and metagenomic analysis.
All documented features are public APIs, accessible via subcommands. Internal implementation details (e.g., low-level k-mer engines) are omitted.
Core Subcommands
obik index
Builds or extends a k-mer set group from raw sequences:
- Configurable
k(2–31) and optional minimizer size (m) for space-efficient hashing. - Filters by k-mer frequency:
--minocc,--maxocc. - Entropy-based low-complexity filtering (
--entropy-threshold,--entropy-size). - Supports metadata tagging at group, set, and per-set levels (
--set-tag,--index-id). - Optionally saves top N frequent k-mers (
--save-freq-kmers) for downstream analysis. - Parallel sequence processing with atomic counters and thread-safe batching.
obik ls
Lists metadata of k-mer sets in an index:
- Accepts glob-like
--set PATTERNs to filter target sets. - Outputs structured metadata: set index, ID, and k-mer count (
count). - Supports multiple formats: CSV (default), JSON, YAML.
- No k-mers themselves are printed—only set-level summaries.
obik summary
Aggregates and reports comprehensive statistics:
- Structural info: k, m, partitions, total sets/unique kmers.
- Per-set stats: ID, count, disk usage (computed recursively).
- Optional pairwise Jaccard distance matrix for similarity analysis.
- Multi-format export (JSON/YAML/CSV) with full metadata preservation.
obik cp
Copies selected or all k-mer sets from a source index to a new destination:
- Requires
<source_index>and<dest_dir>. - Pattern-based selection via
--set PATTERN(glob-style); fails if no match. - Prevents overwrites unless
--force. - Uses atomic copy operations via
CopySetsByIDTo, preserving original structure.
obik mv
Safely moves sets between indices:
- Copy-first, then delete strategy ensures atomicity.
- Supports
--set PATTERNfor selective moves; fails if no sets match patterns. - Removes source sets in reverse order to avoid index renumbering issues.
- Logs progress and final counts for observability.
obik rm
Removes k-mer sets from an index:
- Requires at least one glob-like
--set PATTERN. - Validates existence and match success before deletion.
- Deletes sets in reverse order to preserve indices during bulk removals.
- Fails fast on errors, leaving index consistent.
obik spectrum
Exports k-mer frequency spectra per set:
- Computes histogram: how many distinct kmers occur exactly N times.
- Outputs sparse CSV (only non-zero frequencies), with per-set columns.
- Enables comparative analysis of redundancy/complexity across samples.
obik filter
Filters k-mers from an index using configurable criteria:
- Currently supports entropy-based filtering (
--entropy-threshold,--entropy-size). - Runs in parallel across partitions (per-worker filter instantiation for stateful filters).
- Preserves partitioning structure and
spectrum.binfiles. - Logs per-set statistics (kept %, total processed).
obik match
Annotates query sequences with reference matches:
- Loads a k-mer index and selects target sets via patterns.
- Reads sequences (FASTA/FASTQ), prepares queries in parallel, and merges batches incrementally.
- Matches k-mers against reference sets using
MatchBatch, attaches match positions as attributes (e.g.,"kmer_matched_ref_genome"). - Streams annotated output with paired-end integrity preserved.
obik lowmask
Masks or extracts low-complexity regions in sequences:
- Uses multi-scale entropy analysis (window sizes 1–
level_max) on canonical k-mers. - Three modes: mask (replace with
.or custom char), split, and extract low-complexity fragments. - Preserves metadata (e.g., entropy values) on output sequences.
obik super
Extracts super k-mers from overlapping reads:
- Merges contiguously overlapped kmers sharing a minimizer into longer, non-overlapping super-k-mers.
- Configurable
kandm; parallelized via worker pipeline. - Optimized for alignment-free analysis, read correction, and compression.
Shared Capabilities
Set Selection
- Glob-style pattern matching (
--set PATTERN, repeatable). - Resolves to exact set IDs using
MatchSetIDs.
Output Formatting
- Structured output: CSV, JSON (
--json-output), YAML (--yaml-output) across multiple commands.
Metadata Handling
- Group-, set-, and per-kmer metadata support (
--set-tag,metadata.toml). - Preserved during copy/move/filter operations.
Safety & Observability
- Structured logging (Logrus), progress bars (
progressbar). - Context-aware cancellation and timeout support.
- Detailed error wrapping with
%w.
Parallelism
- Multi-worker pipelines (e.g.,
nworkersfrom system defaults). - Thread-safe accumulation and atomic counters where needed.
Note
: All commands assume a valid
KmerSetGroupindex structure (.kdi,.toml). No k-mer sequences themselves are printed—only metadata, counts, or match annotations.