autodoc/cmd/obiconsensus.md

# obiconsensus(1) — OBITools4 Manual

## NAME

`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences

## SYNOPSIS

```
obiconsensus [OPTIONS] [FILE...]
```

## DESCRIPTION

`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.

The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.

Two denoising strategies are available:

- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.

Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.

The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.

## INPUT FORMATS

`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:

| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |

Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).

## OUTPUT FORMATS

By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:

- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip

Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.

## DENOISING OPTIONS

`--distance INT`, `-d INT`
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.

`--cluster`, `-C`
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.

`--kmer-size SIZE`
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.

`--no-singleton`
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.

`--low-coverage FLOAT`
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.

`--sample ATTRIBUTE`, `-s ATTRIBUTE`
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.

## OUTPUT ANNOTATION OPTIONS

`--unique`, `-U`
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.

`--save-graph DIRECTORY`
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.

`--save-ratio FILE`
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.

## PERFORMANCE OPTIONS

`--max-cpu INT`
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.

`--batch-size INT`
: Minimum number of sequences processed together in a single batch. Default: 1.

`--batch-size-max INT`
: Maximum number of sequences in a single batch. Default: 2000.

`--batch-mem STRING`
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.

`--no-progressbar`
: Disable the progress bar.

`--no-order`
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.

## OTHER OPTIONS

`--u-to-t`
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.

`--skip-empty`
: Remove sequences of length zero from the output.

`--solexa`
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.

`--silent-warning`
: Suppress warning messages.

`--debug`
: Enable detailed logging for troubleshooting.

`--version`
: Print the version number and exit.

`--help`, `-h`
: Display a brief help message and exit.

## OUTPUT ATTRIBUTES

Each output consensus sequence carries several annotation attributes describing how it was built:

| Attribute | Description |
|-----------|-------------|
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
| `count` | Total number of reads merged into this consensus across all samples |
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
| `seq_length` | Length of the consensus sequence |

## EXAMPLES

**Basic denoising of a FASTQ file:**

```sh
obiconsensus reads.fastq > denoised.fastq
```

**Increase the allowed distance between reads to 2:**

```sh
obiconsensus --distance 2 reads.fastq > denoised.fastq
```

**Use clustering mode and remove singletons:**

```sh
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
```

**Denoise, then dereplicate the output:**

```sh
obiconsensus --unique reads.fastq > denoised_uniq.fastq
```

**Save denoising graphs for inspection:**

```sh
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
```

**Specify the sample annotation attribute:**

```sh
obiconsensus --sample library reads.fastq > denoised.fastq
```

## SEE ALSO

`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)

## NOTES

`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.

The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
⬆️ version bump to v4.5 2026-04-07 08:36:50 +02:00			`# obiconsensus(1) — OBITools4 Manual`

			`## NAME`

			`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences

			`## SYNOPSIS`

			```
			`obiconsensus [OPTIONS] [FILE...]`
			```

			`## DESCRIPTION`

			`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.

			The tool works by constructing a difference graph: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a de Bruijn graph approach. The result is a set of high-quality representative sequences, one per cluster.

			`Two denoising strategies are available:`

			`- Standard mode (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.`
			- Clustering mode (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.

			Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.

			The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.

			`## INPUT FORMATS`

			`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:

			`\| Flag \| Format \|`
			`\|------\|--------\|`
			\| `--fasta` \| FASTA \|
			\| `--fastq` \| FASTQ \|
			\| `--embl` \| EMBL flat file \|
			\| `--genbank` \| GenBank flat file \|
			\| `--ecopcr` \| ecoPCR output \|
			\| `--csv` \| CSV tabular format \|

			Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).

			`## OUTPUT FORMATS`

			`By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:`

			- `--fasta-output` — write FASTA
			- `--fastq-output` — write FASTQ
			- `--json-output` — write JSON
			- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
			- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
			- `--compress` / `-Z` — compress output with gzip

			Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.

			`## DENOISING OPTIONS`

			`--distance INT`, `-d INT`
			`: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.`

			`--cluster`, `-C`
			`: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.`

			`--kmer-size SIZE`
			: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.

			`--no-singleton`
			`: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.`

			`--low-coverage FLOAT`
			`: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.`

			`--sample ATTRIBUTE`, `-s ATTRIBUTE`
			: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.

			`## OUTPUT ANNOTATION OPTIONS`

			`--unique`, `-U`
			: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.

			`--save-graph DIRECTORY`
			`: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.`

			`--save-ratio FILE`
			`: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.`

			`## PERFORMANCE OPTIONS`

			`--max-cpu INT`
			`: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.`

			`--batch-size INT`
			`: Minimum number of sequences processed together in a single batch. Default: 1.`

			`--batch-size-max INT`
			`: Maximum number of sequences in a single batch. Default: 2000.`

			`--batch-mem STRING`
			: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.

			`--no-progressbar`
			`: Disable the progress bar.`

			`--no-order`
			`: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.`

			`## OTHER OPTIONS`

			`--u-to-t`
			`: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.`

			`--skip-empty`
			`: Remove sequences of length zero from the output.`

			`--solexa`
			`: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.`

			`--silent-warning`
			`: Suppress warning messages.`

			`--debug`
			`: Enable detailed logging for troubleshooting.`

			`--version`
			`: Print the version number and exit.`

			`--help`, `-h`
			`: Display a brief help message and exit.`

			`## OUTPUT ATTRIBUTES`

			`Each output consensus sequence carries several annotation attributes describing how it was built:`

			`\| Attribute \| Description \|`
			`\|-----------\|-------------\|`
			\| `consensus` \| Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) \|
			\| `merged_sample` \| Map of sample names to read counts contributing to this consensus \|
			\| `count` \| Total number of reads merged into this consensus across all samples \|
			\| `kmer_size` \| Size of the k-mers used to build the de Bruijn graph for this consensus \|
			\| `seq_length` \| Length of the consensus sequence \|

			`## EXAMPLES`

			`Basic denoising of a FASTQ file:`

			```sh
			`obiconsensus reads.fastq > denoised.fastq`
			```

			`Increase the allowed distance between reads to 2:`

			```sh
			`obiconsensus --distance 2 reads.fastq > denoised.fastq`
			```

			`Use clustering mode and remove singletons:`

			```sh
			`obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq`
			```

			`Denoise, then dereplicate the output:`

			```sh
			`obiconsensus --unique reads.fastq > denoised_uniq.fastq`
			```

			`Save denoising graphs for inspection:`

			```sh
			`obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq`
			```

			`Specify the sample annotation attribute:`

			```sh
			`obiconsensus --sample library reads.fastq > denoised.fastq`
			```

			`## SEE ALSO`

			`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)

			`## NOTES`

			`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.

			The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.