Files
obitools4/autodoc/docmd/pkg_obitools_obik.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

114 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obik`: K-mer Index Management Toolkit for Biological Sequences
`obik` is a CLI tool from the OBITools4 ecosystem designed for building, inspecting, filtering, and manipulating **k-mer indices**—compact data structures encoding k-mer occurrences from biological sequences (e.g., FASTA/FASTQ). It enables scalable, parallelized processing of large-scale sequencing data for applications such as taxonomic profiling, contamination screening, and metagenomic analysis.
All documented features are **public APIs**, accessible via subcommands. Internal implementation details (e.g., low-level k-mer engines) are omitted.
---
## Core Subcommands
### `obik index`
Builds or extends a k-mer set group from raw sequences:
- Configurable `k` (231) and optional minimizer size (`m`) for space-efficient hashing.
- Filters by k-mer frequency: `--minocc`, `--maxocc`.
- Entropy-based low-complexity filtering (`--entropy-threshold`, `--entropy-size`).
- Supports metadata tagging at group, set, and per-set levels (`--set-tag`, `--index-id`).
- Optionally saves top *N* frequent k-mers (`--save-freq-kmers`) for downstream analysis.
- Parallel sequence processing with atomic counters and thread-safe batching.
### `obik ls`
Lists metadata of k-mer sets in an index:
- Accepts glob-like `--set PATTERN`s to filter target sets.
- Outputs structured metadata: set index, ID, and k-mer count (`count`).
- Supports multiple formats: CSV (default), JSON, YAML.
- No k-mers themselves are printed—only set-level summaries.
### `obik summary`
Aggregates and reports comprehensive statistics:
- Structural info: k, m, partitions, total sets/unique kmers.
- Per-set stats: ID, count, disk usage (computed recursively).
- Optional pairwise **Jaccard distance matrix** for similarity analysis.
- Multi-format export (JSON/YAML/CSV) with full metadata preservation.
### `obik cp`
Copies selected or all k-mer sets from a source index to a new destination:
- Requires `<source_index>` and `<dest_dir>`.
- Pattern-based selection via `--set PATTERN` (glob-style); fails if no match.
- Prevents overwrites unless `--force`.
- Uses atomic copy operations via `CopySetsByIDTo`, preserving original structure.
### `obik mv`
Safely moves sets between indices:
- Copy-first, then delete strategy ensures atomicity.
- Supports `--set PATTERN` for selective moves; fails if no sets match patterns.
- Removes source sets in reverse order to avoid index renumbering issues.
- Logs progress and final counts for observability.
### `obik rm`
Removes k-mer sets from an index:
- Requires at least one glob-like `--set PATTERN`.
- Validates existence and match success before deletion.
- Deletes sets in reverse order to preserve indices during bulk removals.
- Fails fast on errors, leaving index consistent.
### `obik spectrum`
Exports k-mer frequency spectra per set:
- Computes histogram: how many distinct kmers occur *exactly N times*.
- Outputs sparse CSV (only non-zero frequencies), with per-set columns.
- Enables comparative analysis of redundancy/complexity across samples.
### `obik filter`
Filters k-mers from an index using configurable criteria:
- Currently supports entropy-based filtering (`--entropy-threshold`, `--entropy-size`).
- Runs in parallel across partitions (per-worker filter instantiation for stateful filters).
- Preserves partitioning structure and `spectrum.bin` files.
- Logs per-set statistics (kept %, total processed).
### `obik match`
Annotates query sequences with reference matches:
- Loads a k-mer index and selects target sets via patterns.
- Reads sequences (FASTA/FASTQ), prepares queries in parallel, and merges batches incrementally.
- Matches k-mers against reference sets using `MatchBatch`, attaches match positions as attributes (e.g., `"kmer_matched_ref_genome"`).
- Streams annotated output with paired-end integrity preserved.
### `obik lowmask`
Masks or extracts low-complexity regions in sequences:
- Uses multi-scale entropy analysis (window sizes 1`level_max`) on canonical k-mers.
- Three modes: **mask** (replace with `.` or custom char), **split**, and **extract low-complexity fragments**.
- Preserves metadata (e.g., entropy values) on output sequences.
### `obik super`
Extracts *super k-mers* from overlapping reads:
- Merges contiguously overlapped kmers sharing a minimizer into longer, non-overlapping super-k-mers.
- Configurable `k` and `m`; parallelized via worker pipeline.
- Optimized for alignment-free analysis, read correction, and compression.
---
## Shared Capabilities
### Set Selection
- Glob-style pattern matching (`--set PATTERN`, repeatable).
- Resolves to exact set IDs using `MatchSetIDs`.
### Output Formatting
- Structured output: CSV, JSON (`--json-output`), YAML (`--yaml-output`) across multiple commands.
### Metadata Handling
- Group-, set-, and per-kmer metadata support (`--set-tag`, `metadata.toml`).
- Preserved during copy/move/filter operations.
### Safety & Observability
- Structured logging (Logrus), progress bars (`progressbar`).
- Context-aware cancellation and timeout support.
- Detailed error wrapping with `%w`.
### Parallelism
- Multi-worker pipelines (e.g., `nworkers` from system defaults).
- Thread-safe accumulation and atomic counters where needed.
---
> **Note**: All commands assume a valid `KmerSetGroup` index structure (`.kdi`, `.toml`). No k-mer sequences themselves are printed—only metadata, counts, or match annotations.