- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
7.8 KiB
obiconsensus(1) — OBITools4 Manual
NAME
obiconsensus — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
SYNOPSIS
obiconsensus [OPTIONS] [FILE...]
DESCRIPTION
obiconsensus is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. obiconsensus groups these related reads and builds a single, more reliable consensus sequence for each group.
The tool works by constructing a difference graph: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by --distance). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a de Bruijn graph approach. The result is a set of high-quality representative sequences, one per cluster.
Two denoising strategies are available:
- Standard mode (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
- Clustering mode (
--cluster): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with --out.
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: sample). Each sample's reads are denoised independently.
INPUT FORMATS
obiconsensus recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|---|---|
--fasta |
FASTA |
--fastq |
FASTQ |
--embl |
EMBL flat file |
--genbank |
GenBank flat file |
--ecopcr |
ecoPCR output |
--csv |
CSV tabular format |
Header annotation styles can be selected with --input-OBI-header (OBITools format) or --input-json-header (JSON format).
OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
--fasta-output— write FASTA--fastq-output— write FASTQ--json-output— write JSON--output-OBI-header/-O— annotate FASTA/FASTQ title lines in OBITools format--output-json-header— annotate FASTA/FASTQ title lines in JSON format--compress/-Z— compress output with gzip
Use --out FILE / -o FILE to write results to a file instead of standard output.
DENOISING OPTIONS
--distance INT,-d INT- Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
--cluster,-C- Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
--kmer-size SIZE- Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of
-1means the size is estimated automatically from the data. Manual adjustment is rarely needed. --no-singleton- Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
--low-coverage FLOAT- Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
--sample ATTRIBUTE,-s ATTRIBUTE- Name of the sequence annotation attribute that identifies the sample of origin. Default:
sample. Each unique value of this attribute is treated as an independent sample during denoising.
OUTPUT ANNOTATION OPTIONS
--unique,-U- After denoising, dereplicate the output sequences (equivalent to running
obiuniq). Identical consensus sequences across samples are merged into a single record carrying abundance information. --save-graph DIRECTORY- Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
--save-ratio FILE- Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
PERFORMANCE OPTIONS
--max-cpu INT- Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
--batch-size INT- Minimum number of sequences processed together in a single batch. Default: 1.
--batch-size-max INT- Maximum number of sequences in a single batch. Default: 2000.
--batch-mem STRING- Maximum memory allocated per batch (e.g.,
128M,1G). Default:128M. Set to0to disable the memory limit. --no-progressbar- Disable the progress bar.
--no-order- When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
OTHER OPTIONS
--u-to-t- Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
--skip-empty- Remove sequences of length zero from the output.
--solexa- Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
--silent-warning- Suppress warning messages.
--debug- Enable detailed logging for troubleshooting.
--version- Print the version number and exit.
--help,-h- Display a brief help message and exit.
OUTPUT ATTRIBUTES
Each output consensus sequence carries several annotation attributes describing how it was built:
| Attribute | Description |
|---|---|
consensus |
Boolean flag: true if the sequence is a true consensus, false if it was kept unchanged (e.g., isolated singleton) |
merged_sample |
Map of sample names to read counts contributing to this consensus |
count |
Total number of reads merged into this consensus across all samples |
kmer_size |
Size of the k-mers used to build the de Bruijn graph for this consensus |
seq_length |
Length of the consensus sequence |
EXAMPLES
Basic denoising of a FASTQ file:
obiconsensus reads.fastq > denoised.fastq
Increase the allowed distance between reads to 2:
obiconsensus --distance 2 reads.fastq > denoised.fastq
Use clustering mode and remove singletons:
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
Denoise, then dereplicate the output:
obiconsensus --unique reads.fastq > denoised_uniq.fastq
Save denoising graphs for inspection:
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
Specify the sample annotation attribute:
obiconsensus --sample library reads.fastq > denoised.fastq
SEE ALSO
obiuniq(1), obiclean(1), obigrep(1), obiconvert(1)
NOTES
obiconsensus was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, obiclean may be more appropriate.
The automatic k-mer size selection (--kmer-size -1) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.