# obiconsensus(1) — OBITools4 Manual ## NAME `obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences ## SYNOPSIS ``` obiconsensus [OPTIONS] [FILE...] ``` ## DESCRIPTION `obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group. The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster. Two denoising strategies are available: - **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours. - **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood. Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`. The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently. ## INPUT FORMATS `obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag: | Flag | Format | |------|--------| | `--fasta` | FASTA | | `--fastq` | FASTQ | | `--embl` | EMBL flat file | | `--genbank` | GenBank flat file | | `--ecopcr` | ecoPCR output | | `--csv` | CSV tabular format | Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format). ## OUTPUT FORMATS By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced: - `--fasta-output` — write FASTA - `--fastq-output` — write FASTQ - `--json-output` — write JSON - `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format - `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format - `--compress` / `-Z` — compress output with gzip Use `--out FILE` / `-o FILE` to write results to a file instead of standard output. ## DENOISING OPTIONS `--distance INT`, `-d INT` : Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together. `--cluster`, `-C` : Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences. `--kmer-size SIZE` : Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed. `--no-singleton` : Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal. `--low-coverage FLOAT` : Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples. `--sample ATTRIBUTE`, `-s ATTRIBUTE` : Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising. ## OUTPUT ANNOTATION OPTIONS `--unique`, `-U` : After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information. `--save-graph DIRECTORY` : Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure. `--save-ratio FILE` : Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning. ## PERFORMANCE OPTIONS `--max-cpu INT` : Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage. `--batch-size INT` : Minimum number of sequences processed together in a single batch. Default: 1. `--batch-size-max INT` : Maximum number of sequences in a single batch. Default: 2000. `--batch-mem STRING` : Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit. `--no-progressbar` : Disable the progress bar. `--no-order` : When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs. ## OTHER OPTIONS `--u-to-t` : Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context. `--skip-empty` : Remove sequences of length zero from the output. `--solexa` : Interpret quality scores using the Solexa encoding rather than the standard Phred encoding. `--silent-warning` : Suppress warning messages. `--debug` : Enable detailed logging for troubleshooting. `--version` : Print the version number and exit. `--help`, `-h` : Display a brief help message and exit. ## OUTPUT ATTRIBUTES Each output consensus sequence carries several annotation attributes describing how it was built: | Attribute | Description | |-----------|-------------| | `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) | | `merged_sample` | Map of sample names to read counts contributing to this consensus | | `count` | Total number of reads merged into this consensus across all samples | | `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus | | `seq_length` | Length of the consensus sequence | ## EXAMPLES **Basic denoising of a FASTQ file:** ```sh obiconsensus reads.fastq > denoised.fastq ``` **Increase the allowed distance between reads to 2:** ```sh obiconsensus --distance 2 reads.fastq > denoised.fastq ``` **Use clustering mode and remove singletons:** ```sh obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq ``` **Denoise, then dereplicate the output:** ```sh obiconsensus --unique reads.fastq > denoised_uniq.fastq ``` **Save denoising graphs for inspection:** ```sh obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq ``` **Specify the sample annotation attribute:** ```sh obiconsensus --sample library reads.fastq > denoised.fastq ``` ## SEE ALSO `obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1) ## NOTES `obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate. The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.