mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
114 lines
5.3 KiB
Markdown
114 lines
5.3 KiB
Markdown
|
|
# `obik`: K-mer Index Management Toolkit for Biological Sequences
|
|||
|
|
|
|||
|
|
`obik` is a CLI tool from the OBITools4 ecosystem designed for building, inspecting, filtering, and manipulating **k-mer indices**—compact data structures encoding k-mer occurrences from biological sequences (e.g., FASTA/FASTQ). It enables scalable, parallelized processing of large-scale sequencing data for applications such as taxonomic profiling, contamination screening, and metagenomic analysis.
|
|||
|
|
|
|||
|
|
All documented features are **public APIs**, accessible via subcommands. Internal implementation details (e.g., low-level k-mer engines) are omitted.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Core Subcommands
|
|||
|
|
|
|||
|
|
### `obik index`
|
|||
|
|
Builds or extends a k-mer set group from raw sequences:
|
|||
|
|
- Configurable `k` (2–31) and optional minimizer size (`m`) for space-efficient hashing.
|
|||
|
|
- Filters by k-mer frequency: `--minocc`, `--maxocc`.
|
|||
|
|
- Entropy-based low-complexity filtering (`--entropy-threshold`, `--entropy-size`).
|
|||
|
|
- Supports metadata tagging at group, set, and per-set levels (`--set-tag`, `--index-id`).
|
|||
|
|
- Optionally saves top *N* frequent k-mers (`--save-freq-kmers`) for downstream analysis.
|
|||
|
|
- Parallel sequence processing with atomic counters and thread-safe batching.
|
|||
|
|
|
|||
|
|
### `obik ls`
|
|||
|
|
Lists metadata of k-mer sets in an index:
|
|||
|
|
- Accepts glob-like `--set PATTERN`s to filter target sets.
|
|||
|
|
- Outputs structured metadata: set index, ID, and k-mer count (`count`).
|
|||
|
|
- Supports multiple formats: CSV (default), JSON, YAML.
|
|||
|
|
- No k-mers themselves are printed—only set-level summaries.
|
|||
|
|
|
|||
|
|
### `obik summary`
|
|||
|
|
Aggregates and reports comprehensive statistics:
|
|||
|
|
- Structural info: k, m, partitions, total sets/unique kmers.
|
|||
|
|
- Per-set stats: ID, count, disk usage (computed recursively).
|
|||
|
|
- Optional pairwise **Jaccard distance matrix** for similarity analysis.
|
|||
|
|
- Multi-format export (JSON/YAML/CSV) with full metadata preservation.
|
|||
|
|
|
|||
|
|
### `obik cp`
|
|||
|
|
Copies selected or all k-mer sets from a source index to a new destination:
|
|||
|
|
- Requires `<source_index>` and `<dest_dir>`.
|
|||
|
|
- Pattern-based selection via `--set PATTERN` (glob-style); fails if no match.
|
|||
|
|
- Prevents overwrites unless `--force`.
|
|||
|
|
- Uses atomic copy operations via `CopySetsByIDTo`, preserving original structure.
|
|||
|
|
|
|||
|
|
### `obik mv`
|
|||
|
|
Safely moves sets between indices:
|
|||
|
|
- Copy-first, then delete strategy ensures atomicity.
|
|||
|
|
- Supports `--set PATTERN` for selective moves; fails if no sets match patterns.
|
|||
|
|
- Removes source sets in reverse order to avoid index renumbering issues.
|
|||
|
|
- Logs progress and final counts for observability.
|
|||
|
|
|
|||
|
|
### `obik rm`
|
|||
|
|
Removes k-mer sets from an index:
|
|||
|
|
- Requires at least one glob-like `--set PATTERN`.
|
|||
|
|
- Validates existence and match success before deletion.
|
|||
|
|
- Deletes sets in reverse order to preserve indices during bulk removals.
|
|||
|
|
- Fails fast on errors, leaving index consistent.
|
|||
|
|
|
|||
|
|
### `obik spectrum`
|
|||
|
|
Exports k-mer frequency spectra per set:
|
|||
|
|
- Computes histogram: how many distinct kmers occur *exactly N times*.
|
|||
|
|
- Outputs sparse CSV (only non-zero frequencies), with per-set columns.
|
|||
|
|
- Enables comparative analysis of redundancy/complexity across samples.
|
|||
|
|
|
|||
|
|
### `obik filter`
|
|||
|
|
Filters k-mers from an index using configurable criteria:
|
|||
|
|
- Currently supports entropy-based filtering (`--entropy-threshold`, `--entropy-size`).
|
|||
|
|
- Runs in parallel across partitions (per-worker filter instantiation for stateful filters).
|
|||
|
|
- Preserves partitioning structure and `spectrum.bin` files.
|
|||
|
|
- Logs per-set statistics (kept %, total processed).
|
|||
|
|
|
|||
|
|
### `obik match`
|
|||
|
|
Annotates query sequences with reference matches:
|
|||
|
|
- Loads a k-mer index and selects target sets via patterns.
|
|||
|
|
- Reads sequences (FASTA/FASTQ), prepares queries in parallel, and merges batches incrementally.
|
|||
|
|
- Matches k-mers against reference sets using `MatchBatch`, attaches match positions as attributes (e.g., `"kmer_matched_ref_genome"`).
|
|||
|
|
- Streams annotated output with paired-end integrity preserved.
|
|||
|
|
|
|||
|
|
### `obik lowmask`
|
|||
|
|
Masks or extracts low-complexity regions in sequences:
|
|||
|
|
- Uses multi-scale entropy analysis (window sizes 1–`level_max`) on canonical k-mers.
|
|||
|
|
- Three modes: **mask** (replace with `.` or custom char), **split**, and **extract low-complexity fragments**.
|
|||
|
|
- Preserves metadata (e.g., entropy values) on output sequences.
|
|||
|
|
|
|||
|
|
### `obik super`
|
|||
|
|
Extracts *super k-mers* from overlapping reads:
|
|||
|
|
- Merges contiguously overlapped kmers sharing a minimizer into longer, non-overlapping super-k-mers.
|
|||
|
|
- Configurable `k` and `m`; parallelized via worker pipeline.
|
|||
|
|
- Optimized for alignment-free analysis, read correction, and compression.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Shared Capabilities
|
|||
|
|
|
|||
|
|
### Set Selection
|
|||
|
|
- Glob-style pattern matching (`--set PATTERN`, repeatable).
|
|||
|
|
- Resolves to exact set IDs using `MatchSetIDs`.
|
|||
|
|
|
|||
|
|
### Output Formatting
|
|||
|
|
- Structured output: CSV, JSON (`--json-output`), YAML (`--yaml-output`) across multiple commands.
|
|||
|
|
|
|||
|
|
### Metadata Handling
|
|||
|
|
- Group-, set-, and per-kmer metadata support (`--set-tag`, `metadata.toml`).
|
|||
|
|
- Preserved during copy/move/filter operations.
|
|||
|
|
|
|||
|
|
### Safety & Observability
|
|||
|
|
- Structured logging (Logrus), progress bars (`progressbar`).
|
|||
|
|
- Context-aware cancellation and timeout support.
|
|||
|
|
- Detailed error wrapping with `%w`.
|
|||
|
|
|
|||
|
|
### Parallelism
|
|||
|
|
- Multi-worker pipelines (e.g., `nworkers` from system defaults).
|
|||
|
|
- Thread-safe accumulation and atomic counters where needed.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
> **Note**: All commands assume a valid `KmerSetGroup` index structure (`.kdi`, `.toml`). No k-mer sequences themselves are printed—only metadata, counts, or match annotations.
|