Files
obitools4/autodoc/docmd/pkg_obitools_obik.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

5.3 KiB
Raw Blame History

obik: K-mer Index Management Toolkit for Biological Sequences

obik is a CLI tool from the OBITools4 ecosystem designed for building, inspecting, filtering, and manipulating k-mer indices—compact data structures encoding k-mer occurrences from biological sequences (e.g., FASTA/FASTQ). It enables scalable, parallelized processing of large-scale sequencing data for applications such as taxonomic profiling, contamination screening, and metagenomic analysis.

All documented features are public APIs, accessible via subcommands. Internal implementation details (e.g., low-level k-mer engines) are omitted.


Core Subcommands

obik index

Builds or extends a k-mer set group from raw sequences:

  • Configurable k (231) and optional minimizer size (m) for space-efficient hashing.
  • Filters by k-mer frequency: --minocc, --maxocc.
  • Entropy-based low-complexity filtering (--entropy-threshold, --entropy-size).
  • Supports metadata tagging at group, set, and per-set levels (--set-tag, --index-id).
  • Optionally saves top N frequent k-mers (--save-freq-kmers) for downstream analysis.
  • Parallel sequence processing with atomic counters and thread-safe batching.

obik ls

Lists metadata of k-mer sets in an index:

  • Accepts glob-like --set PATTERNs to filter target sets.
  • Outputs structured metadata: set index, ID, and k-mer count (count).
  • Supports multiple formats: CSV (default), JSON, YAML.
  • No k-mers themselves are printed—only set-level summaries.

obik summary

Aggregates and reports comprehensive statistics:

  • Structural info: k, m, partitions, total sets/unique kmers.
  • Per-set stats: ID, count, disk usage (computed recursively).
  • Optional pairwise Jaccard distance matrix for similarity analysis.
  • Multi-format export (JSON/YAML/CSV) with full metadata preservation.

obik cp

Copies selected or all k-mer sets from a source index to a new destination:

  • Requires <source_index> and <dest_dir>.
  • Pattern-based selection via --set PATTERN (glob-style); fails if no match.
  • Prevents overwrites unless --force.
  • Uses atomic copy operations via CopySetsByIDTo, preserving original structure.

obik mv

Safely moves sets between indices:

  • Copy-first, then delete strategy ensures atomicity.
  • Supports --set PATTERN for selective moves; fails if no sets match patterns.
  • Removes source sets in reverse order to avoid index renumbering issues.
  • Logs progress and final counts for observability.

obik rm

Removes k-mer sets from an index:

  • Requires at least one glob-like --set PATTERN.
  • Validates existence and match success before deletion.
  • Deletes sets in reverse order to preserve indices during bulk removals.
  • Fails fast on errors, leaving index consistent.

obik spectrum

Exports k-mer frequency spectra per set:

  • Computes histogram: how many distinct kmers occur exactly N times.
  • Outputs sparse CSV (only non-zero frequencies), with per-set columns.
  • Enables comparative analysis of redundancy/complexity across samples.

obik filter

Filters k-mers from an index using configurable criteria:

  • Currently supports entropy-based filtering (--entropy-threshold, --entropy-size).
  • Runs in parallel across partitions (per-worker filter instantiation for stateful filters).
  • Preserves partitioning structure and spectrum.bin files.
  • Logs per-set statistics (kept %, total processed).

obik match

Annotates query sequences with reference matches:

  • Loads a k-mer index and selects target sets via patterns.
  • Reads sequences (FASTA/FASTQ), prepares queries in parallel, and merges batches incrementally.
  • Matches k-mers against reference sets using MatchBatch, attaches match positions as attributes (e.g., "kmer_matched_ref_genome").
  • Streams annotated output with paired-end integrity preserved.

obik lowmask

Masks or extracts low-complexity regions in sequences:

  • Uses multi-scale entropy analysis (window sizes 1level_max) on canonical k-mers.
  • Three modes: mask (replace with . or custom char), split, and extract low-complexity fragments.
  • Preserves metadata (e.g., entropy values) on output sequences.

obik super

Extracts super k-mers from overlapping reads:

  • Merges contiguously overlapped kmers sharing a minimizer into longer, non-overlapping super-k-mers.
  • Configurable k and m; parallelized via worker pipeline.
  • Optimized for alignment-free analysis, read correction, and compression.

Shared Capabilities

Set Selection

  • Glob-style pattern matching (--set PATTERN, repeatable).
  • Resolves to exact set IDs using MatchSetIDs.

Output Formatting

  • Structured output: CSV, JSON (--json-output), YAML (--yaml-output) across multiple commands.

Metadata Handling

  • Group-, set-, and per-kmer metadata support (--set-tag, metadata.toml).
  • Preserved during copy/move/filter operations.

Safety & Observability

  • Structured logging (Logrus), progress bars (progressbar).
  • Context-aware cancellation and timeout support.
  • Detailed error wrapping with %w.

Parallelism

  • Multi-worker pipelines (e.g., nworkers from system defaults).
  • Thread-safe accumulation and atomic counters where needed.

Note

: All commands assume a valid KmerSetGroup index structure (.kdi, .toml). No k-mer sequences themselves are printed—only metadata, counts, or match annotations.