Files
obitools4/autodoc/docmd/pkg_obitools_obiconsensus.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.8 KiB

obiconsensus Package: Semantic Overview

The obiconsensus package delivers scalable, graph-based consensus and denoising tools for high-throughput biological sequence data within the OBITools4 ecosystem. It enables error correction, variant clustering, and consensus reconstruction from related amplicon or metagenomic reads—supporting both single-sample and multi-sample workflows.

Public API Summary

Core Algorithms & Utilities

  • BuildConsensus():
    Constructs a consensus sequence via de Bruijn graph assembly of input reads. Automatically selects optimal k-mer size (fallback: longest common suffix analysis). Detects graph cycles and incrementally increases k until resolved. Optionally persists intermediate graphs (*.gml) and FASTA inputs. Output includes metadata: consensus flag, total read weight (summed abundances), k-mer size used, and graph statistics.

  • SampleWeight():
    Returns a closure that retrieves per-sequence sample abundances (e.g., read counts) from sequence annotations or statistics—enabling weighted graph operations.

  • SeqBySamples():
    Groups sequences by sample identifier, using a configurable annotation key (default: "sample"). Supports grouping based on either statistical attributes (StatsOn) or sequence metadata.

  • BuildDiffSeqGraph():
    Builds a difference graph where nodes represent unique sequences and edges encode single-nucleotide mutations (position + substitution). Uses obialign.D1Or0 for exact alignment or approximate LCS-based distance scaling. Supports parallel edge computation and optional progress bar.

  • MinionDenoise():
    Denoises sequences by identifying high-degree nodes (potential consensus hubs), building local consensuses via BuildConsensus(), and preserving low-degree nodes unchanged. Propagates sample annotations, weights, and metadata.

  • MinionClusterDenoise():
    Denoises via weight-based clustering: aggregates node weights (self + neighbors), selects local maxima as cluster heads, and builds consensus per neighborhood.

  • CLIOBIMinion():
    CLI orchestrator for end-to-end denoising: loads sequences, groups by sample (--sample), builds per-sample difference graphs (optional export via --save-graph), applies denoising (MinionDenoise() or MinionClusterDenoise()), optionally deduplicates output (--unique), and annotates sequence lengths.

Configuration & CLI Helpers

  • Clustering Mode: --cluster (-C) enables graph-based clustering.
  • Distance Threshold: --distance (-d, default: 1) sets max Hamming distance for edge inclusion.
  • K-mer Control: --kmer-size (SIZE, default: -1 = auto-selected).
  • Sample Key: --sample (-s, default: "sample") defines the annotation field for sample grouping.
  • Filtering Options:
    • --no-singleton: excludes unique sequences.
    • --low-coverage (default: 0) filters low-abundance sequences.
  • Output Options:
    • --unique (-U) enables deduplication (via obiuniq).
    • --save-graph DIR exports graphs in GraphML.
    • --save-ratio FILE writes edge abundance ratios as CSV.
  • Format Integration: Works with obiconvert via unified input/output option sets (InputOptionSet, OutputOptionSet) for FASTA/FASTQ handling.
  • Getter Functions: Typed accessors (e.g., CLIDistStepMax(), CLIKmerSize()) decouple argument parsing from core logic.

Design Principles

  • Parallelism: Leverages goroutines and sync.WaitGroup for scalable graph construction.
  • Robustness: Handles edge cases (e.g., single-sequence inputs) gracefully with logging.
  • Extensibility: Modular architecture allows swapping alignment engines or graph representations.

Purpose: Accurate, reproducible consensus and denoising for NGS amplicon/metagenomic data at scale.