mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-29 19:40:40 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

7.8 KiB

Raw Blame History

obiconsensus(1) — OBITools4 Manual

NAME

obiconsensus — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences

SYNOPSIS

obiconsensus [OPTIONS] [FILE...]

DESCRIPTION

obiconsensus is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. obiconsensus groups these related reads and builds a single, more reliable consensus sequence for each group.

The tool works by constructing a difference graph: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by --distance). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a de Bruijn graph approach. The result is a set of high-quality representative sequences, one per cluster.

Two denoising strategies are available:

Standard mode (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
Clustering mode (--cluster): groups reads around local abundance maxima and builds a consensus from each neighbourhood.

Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with --out.

The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: sample). Each sample's reads are denoised independently.

INPUT FORMATS

obiconsensus recognises the following input formats automatically. A specific format can be forced with the corresponding flag:

Flag	Format
`--fasta`	FASTA
`--fastq`	FASTQ
`--embl`	EMBL flat file
`--genbank`	GenBank flat file
`--ecopcr`	ecoPCR output
`--csv`	CSV tabular format

Header annotation styles can be selected with --input-OBI-header (OBITools format) or --input-json-header (JSON format).

OUTPUT FORMATS

By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:

--fasta-output — write FASTA
--fastq-output — write FASTQ
--json-output — write JSON
--output-OBI-header / -O — annotate FASTA/FASTQ title lines in OBITools format
--output-json-header — annotate FASTA/FASTQ title lines in JSON format
--compress / -Z — compress output with gzip

Use --out FILE / -o FILE to write results to a file instead of standard output.

DENOISING OPTIONS

--distance INT, -d INT: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
--cluster, -C: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
--kmer-size SIZE: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of -1 means the size is estimated automatically from the data. Manual adjustment is rarely needed.
--no-singleton: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
--low-coverage FLOAT: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
--sample ATTRIBUTE, -s ATTRIBUTE: Name of the sequence annotation attribute that identifies the sample of origin. Default: sample. Each unique value of this attribute is treated as an independent sample during denoising.

OUTPUT ANNOTATION OPTIONS

--unique, -U: After denoising, dereplicate the output sequences (equivalent to running obiuniq). Identical consensus sequences across samples are merged into a single record carrying abundance information.
--save-graph DIRECTORY: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
--save-ratio FILE: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.

PERFORMANCE OPTIONS

--max-cpu INT: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
--batch-size INT: Minimum number of sequences processed together in a single batch. Default: 1.
--batch-size-max INT: Maximum number of sequences in a single batch. Default: 2000.
--batch-mem STRING: Maximum memory allocated per batch (e.g., 128M, 1G). Default: 128M. Set to 0 to disable the memory limit.
--no-progressbar: Disable the progress bar.
--no-order: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.

OTHER OPTIONS

--u-to-t: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
--skip-empty: Remove sequences of length zero from the output.
--solexa: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
--silent-warning: Suppress warning messages.
--debug: Enable detailed logging for troubleshooting.
--version: Print the version number and exit.
--help, -h: Display a brief help message and exit.

OUTPUT ATTRIBUTES

Each output consensus sequence carries several annotation attributes describing how it was built:

Attribute	Description
`consensus`	Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton)
`merged_sample`	Map of sample names to read counts contributing to this consensus
`count`	Total number of reads merged into this consensus across all samples
`kmer_size`	Size of the k-mers used to build the de Bruijn graph for this consensus
`seq_length`	Length of the consensus sequence

EXAMPLES

Basic denoising of a FASTQ file:

obiconsensus reads.fastq > denoised.fastq

Increase the allowed distance between reads to 2:

obiconsensus --distance 2 reads.fastq > denoised.fastq

Use clustering mode and remove singletons:

obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq

Denoise, then dereplicate the output:

obiconsensus --unique reads.fastq > denoised_uniq.fastq

Save denoising graphs for inspection:

obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq

Specify the sample annotation attribute:

obiconsensus --sample library reads.fastq > denoised.fastq

NOTES

obiconsensus was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, obiclean may be more appropriate.

The automatic k-mer size selection (--kmer-size -1) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.

7.8 KiB Raw Blame History