Files
obitools4/autodoc/cmd/obiconsensus.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

7.8 KiB

obiconsensus(1) — OBITools4 Manual

NAME

obiconsensus — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences

SYNOPSIS

obiconsensus [OPTIONS] [FILE...]

DESCRIPTION

obiconsensus is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. obiconsensus groups these related reads and builds a single, more reliable consensus sequence for each group.

The tool works by constructing a difference graph: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by --distance). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a de Bruijn graph approach. The result is a set of high-quality representative sequences, one per cluster.

Two denoising strategies are available:

  • Standard mode (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
  • Clustering mode (--cluster): groups reads around local abundance maxima and builds a consensus from each neighbourhood.

Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with --out.

The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: sample). Each sample's reads are denoised independently.

INPUT FORMATS

obiconsensus recognises the following input formats automatically. A specific format can be forced with the corresponding flag:

Flag Format
--fasta FASTA
--fastq FASTQ
--embl EMBL flat file
--genbank GenBank flat file
--ecopcr ecoPCR output
--csv CSV tabular format

Header annotation styles can be selected with --input-OBI-header (OBITools format) or --input-json-header (JSON format).

OUTPUT FORMATS

By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:

  • --fasta-output — write FASTA
  • --fastq-output — write FASTQ
  • --json-output — write JSON
  • --output-OBI-header / -O — annotate FASTA/FASTQ title lines in OBITools format
  • --output-json-header — annotate FASTA/FASTQ title lines in JSON format
  • --compress / -Z — compress output with gzip

Use --out FILE / -o FILE to write results to a file instead of standard output.

DENOISING OPTIONS

--distance INT, -d INT
Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
--cluster, -C
Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
--kmer-size SIZE
Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of -1 means the size is estimated automatically from the data. Manual adjustment is rarely needed.
--no-singleton
Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
--low-coverage FLOAT
Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
--sample ATTRIBUTE, -s ATTRIBUTE
Name of the sequence annotation attribute that identifies the sample of origin. Default: sample. Each unique value of this attribute is treated as an independent sample during denoising.

OUTPUT ANNOTATION OPTIONS

--unique, -U
After denoising, dereplicate the output sequences (equivalent to running obiuniq). Identical consensus sequences across samples are merged into a single record carrying abundance information.
--save-graph DIRECTORY
Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
--save-ratio FILE
Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.

PERFORMANCE OPTIONS

--max-cpu INT
Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
--batch-size INT
Minimum number of sequences processed together in a single batch. Default: 1.
--batch-size-max INT
Maximum number of sequences in a single batch. Default: 2000.
--batch-mem STRING
Maximum memory allocated per batch (e.g., 128M, 1G). Default: 128M. Set to 0 to disable the memory limit.
--no-progressbar
Disable the progress bar.
--no-order
When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.

OTHER OPTIONS

--u-to-t
Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
--skip-empty
Remove sequences of length zero from the output.
--solexa
Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
--silent-warning
Suppress warning messages.
--debug
Enable detailed logging for troubleshooting.
--version
Print the version number and exit.
--help, -h
Display a brief help message and exit.

OUTPUT ATTRIBUTES

Each output consensus sequence carries several annotation attributes describing how it was built:

Attribute Description
consensus Boolean flag: true if the sequence is a true consensus, false if it was kept unchanged (e.g., isolated singleton)
merged_sample Map of sample names to read counts contributing to this consensus
count Total number of reads merged into this consensus across all samples
kmer_size Size of the k-mers used to build the de Bruijn graph for this consensus
seq_length Length of the consensus sequence

EXAMPLES

Basic denoising of a FASTQ file:

obiconsensus reads.fastq > denoised.fastq

Increase the allowed distance between reads to 2:

obiconsensus --distance 2 reads.fastq > denoised.fastq

Use clustering mode and remove singletons:

obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq

Denoise, then dereplicate the output:

obiconsensus --unique reads.fastq > denoised_uniq.fastq

Save denoising graphs for inspection:

obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq

Specify the sample annotation attribute:

obiconsensus --sample library reads.fastq > denoised.fastq

SEE ALSO

obiuniq(1), obiclean(1), obigrep(1), obiconvert(1)

NOTES

obiconsensus was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, obiclean may be more appropriate.

The automatic k-mer size selection (--kmer-size -1) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.