mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
50 lines
3.8 KiB
Markdown
50 lines
3.8 KiB
Markdown
# `obiconsensus` Package: Semantic Overview
|
|
|
|
The `obiconsensus` package delivers scalable, graph-based consensus and denoising tools for high-throughput biological sequence data within the OBITools4 ecosystem. It enables error correction, variant clustering, and consensus reconstruction from related amplicon or metagenomic reads—supporting both single-sample and multi-sample workflows.
|
|
|
|
## Public API Summary
|
|
|
|
### Core Algorithms & Utilities
|
|
- **`BuildConsensus()`**:
|
|
Constructs a consensus sequence via *de Bruijn graph* assembly of input reads. Automatically selects optimal `k`-mer size (fallback: longest common suffix analysis). Detects graph cycles and incrementally increases `k` until resolved. Optionally persists intermediate graphs (`*.gml`) and FASTA inputs. Output includes metadata: consensus flag, total read weight (summed abundances), `k`-mer size used, and graph statistics.
|
|
|
|
- **`SampleWeight()`**:
|
|
Returns a closure that retrieves per-sequence sample abundances (e.g., read counts) from sequence annotations or statistics—enabling weighted graph operations.
|
|
|
|
- **`SeqBySamples()`**:
|
|
Groups sequences by sample identifier, using a configurable annotation key (default: `"sample"`). Supports grouping based on either statistical attributes (`StatsOn`) or sequence metadata.
|
|
|
|
- **`BuildDiffSeqGraph()`**:
|
|
Builds a *difference graph* where nodes represent unique sequences and edges encode single-nucleotide mutations (position + substitution). Uses `obialign.D1Or0` for exact alignment or approximate LCS-based distance scaling. Supports parallel edge computation and optional progress bar.
|
|
|
|
- **`MinionDenoise()`**:
|
|
Denoises sequences by identifying high-degree nodes (potential consensus hubs), building local consensuses via `BuildConsensus()`, and preserving low-degree nodes unchanged. Propagates sample annotations, weights, and metadata.
|
|
|
|
- **`MinionClusterDenoise()`**:
|
|
Denoises via *weight-based clustering*: aggregates node weights (self + neighbors), selects local maxima as cluster heads, and builds consensus per neighborhood.
|
|
|
|
- **`CLIOBIMinion()`**:
|
|
CLI orchestrator for end-to-end denoising: loads sequences, groups by sample (`--sample`), builds per-sample difference graphs (optional export via `--save-graph`), applies denoising (`MinionDenoise()` or `MinionClusterDenoise()`), optionally deduplicates output (`--unique`), and annotates sequence lengths.
|
|
|
|
### Configuration & CLI Helpers
|
|
- **Clustering Mode**: `--cluster` (`-C`) enables graph-based clustering.
|
|
- **Distance Threshold**: `--distance` (`-d`, default: 1) sets max Hamming distance for edge inclusion.
|
|
- **K-mer Control**: `--kmer-size` (`SIZE`, default: -1 = auto-selected).
|
|
- **Sample Key**: `--sample` (`-s`, default: `"sample"`) defines the annotation field for sample grouping.
|
|
- **Filtering Options**:
|
|
- `--no-singleton`: excludes unique sequences.
|
|
- `--low-coverage` (default: 0) filters low-abundance sequences.
|
|
- **Output Options**:
|
|
- `--unique` (`-U`) enables deduplication (via `obiuniq`).
|
|
- `--save-graph DIR` exports graphs in GraphML.
|
|
- `--save-ratio FILE` writes edge abundance ratios as CSV.
|
|
- **Format Integration**: Works with `obiconvert` via unified input/output option sets (`InputOptionSet`, `OutputOptionSet`) for FASTA/FASTQ handling.
|
|
- **Getter Functions**: Typed accessors (e.g., `CLIDistStepMax()`, `CLIKmerSize()`) decouple argument parsing from core logic.
|
|
|
|
## Design Principles
|
|
- **Parallelism**: Leverages goroutines and `sync.WaitGroup` for scalable graph construction.
|
|
- **Robustness**: Handles edge cases (e.g., single-sequence inputs) gracefully with logging.
|
|
- **Extensibility**: Modular architecture allows swapping alignment engines or graph representations.
|
|
|
|
*Purpose: Accurate, reproducible consensus and denoising for NGS amplicon/metagenomic data at scale.*
|