mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
189 lines
7.8 KiB
Markdown
189 lines
7.8 KiB
Markdown
|
|
# obiconsensus(1) — OBITools4 Manual
|
||
|
|
|
||
|
|
## NAME
|
||
|
|
|
||
|
|
`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
|
||
|
|
|
||
|
|
## SYNOPSIS
|
||
|
|
|
||
|
|
```
|
||
|
|
obiconsensus [OPTIONS] [FILE...]
|
||
|
|
```
|
||
|
|
|
||
|
|
## DESCRIPTION
|
||
|
|
|
||
|
|
`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.
|
||
|
|
|
||
|
|
The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.
|
||
|
|
|
||
|
|
Two denoising strategies are available:
|
||
|
|
|
||
|
|
- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
|
||
|
|
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
|
||
|
|
|
||
|
|
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.
|
||
|
|
|
||
|
|
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.
|
||
|
|
|
||
|
|
## INPUT FORMATS
|
||
|
|
|
||
|
|
`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
|
||
|
|
|
||
|
|
| Flag | Format |
|
||
|
|
|------|--------|
|
||
|
|
| `--fasta` | FASTA |
|
||
|
|
| `--fastq` | FASTQ |
|
||
|
|
| `--embl` | EMBL flat file |
|
||
|
|
| `--genbank` | GenBank flat file |
|
||
|
|
| `--ecopcr` | ecoPCR output |
|
||
|
|
| `--csv` | CSV tabular format |
|
||
|
|
|
||
|
|
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
|
||
|
|
|
||
|
|
## OUTPUT FORMATS
|
||
|
|
|
||
|
|
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
|
||
|
|
|
||
|
|
- `--fasta-output` — write FASTA
|
||
|
|
- `--fastq-output` — write FASTQ
|
||
|
|
- `--json-output` — write JSON
|
||
|
|
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
|
||
|
|
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
|
||
|
|
- `--compress` / `-Z` — compress output with gzip
|
||
|
|
|
||
|
|
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
|
||
|
|
|
||
|
|
## DENOISING OPTIONS
|
||
|
|
|
||
|
|
`--distance INT`, `-d INT`
|
||
|
|
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
|
||
|
|
|
||
|
|
`--cluster`, `-C`
|
||
|
|
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
|
||
|
|
|
||
|
|
`--kmer-size SIZE`
|
||
|
|
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.
|
||
|
|
|
||
|
|
`--no-singleton`
|
||
|
|
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
|
||
|
|
|
||
|
|
`--low-coverage FLOAT`
|
||
|
|
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
|
||
|
|
|
||
|
|
`--sample ATTRIBUTE`, `-s ATTRIBUTE`
|
||
|
|
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.
|
||
|
|
|
||
|
|
## OUTPUT ANNOTATION OPTIONS
|
||
|
|
|
||
|
|
`--unique`, `-U`
|
||
|
|
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.
|
||
|
|
|
||
|
|
`--save-graph DIRECTORY`
|
||
|
|
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
|
||
|
|
|
||
|
|
`--save-ratio FILE`
|
||
|
|
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
|
||
|
|
|
||
|
|
## PERFORMANCE OPTIONS
|
||
|
|
|
||
|
|
`--max-cpu INT`
|
||
|
|
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
|
||
|
|
|
||
|
|
`--batch-size INT`
|
||
|
|
: Minimum number of sequences processed together in a single batch. Default: 1.
|
||
|
|
|
||
|
|
`--batch-size-max INT`
|
||
|
|
: Maximum number of sequences in a single batch. Default: 2000.
|
||
|
|
|
||
|
|
`--batch-mem STRING`
|
||
|
|
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.
|
||
|
|
|
||
|
|
`--no-progressbar`
|
||
|
|
: Disable the progress bar.
|
||
|
|
|
||
|
|
`--no-order`
|
||
|
|
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
|
||
|
|
|
||
|
|
## OTHER OPTIONS
|
||
|
|
|
||
|
|
`--u-to-t`
|
||
|
|
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
|
||
|
|
|
||
|
|
`--skip-empty`
|
||
|
|
: Remove sequences of length zero from the output.
|
||
|
|
|
||
|
|
`--solexa`
|
||
|
|
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
|
||
|
|
|
||
|
|
`--silent-warning`
|
||
|
|
: Suppress warning messages.
|
||
|
|
|
||
|
|
`--debug`
|
||
|
|
: Enable detailed logging for troubleshooting.
|
||
|
|
|
||
|
|
`--version`
|
||
|
|
: Print the version number and exit.
|
||
|
|
|
||
|
|
`--help`, `-h`
|
||
|
|
: Display a brief help message and exit.
|
||
|
|
|
||
|
|
## OUTPUT ATTRIBUTES
|
||
|
|
|
||
|
|
Each output consensus sequence carries several annotation attributes describing how it was built:
|
||
|
|
|
||
|
|
| Attribute | Description |
|
||
|
|
|-----------|-------------|
|
||
|
|
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
|
||
|
|
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
|
||
|
|
| `count` | Total number of reads merged into this consensus across all samples |
|
||
|
|
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
|
||
|
|
| `seq_length` | Length of the consensus sequence |
|
||
|
|
|
||
|
|
## EXAMPLES
|
||
|
|
|
||
|
|
**Basic denoising of a FASTQ file:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus reads.fastq > denoised.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
**Increase the allowed distance between reads to 2:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus --distance 2 reads.fastq > denoised.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
**Use clustering mode and remove singletons:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
**Denoise, then dereplicate the output:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus --unique reads.fastq > denoised_uniq.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
**Save denoising graphs for inspection:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
**Specify the sample annotation attribute:**
|
||
|
|
|
||
|
|
```sh
|
||
|
|
obiconsensus --sample library reads.fastq > denoised.fastq
|
||
|
|
```
|
||
|
|
|
||
|
|
## SEE ALSO
|
||
|
|
|
||
|
|
`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)
|
||
|
|
|
||
|
|
## NOTES
|
||
|
|
|
||
|
|
`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.
|
||
|
|
|
||
|
|
The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
|