mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

14 KiB

Raw Blame History

NAME

obiuniq — dereplicate sequence data sets

SYNOPSIS

obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
        [--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
        [--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
        [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
        [--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
        [--input-OBI-header] [--input-json-header] [--json-output]
        [--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
        [--no-order] [--no-progressbar] [--no-singleton]
        [--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
        [--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
        [--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
        [--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
        [--with-leaves] [<args>]

DESCRIPTION

obiuniq groups identical sequences together and replaces them with a single representative, recording the total number of original occurrences as an abundance count. This process — called dereplication — is a standard step in amplicon sequencing workflows: it dramatically reduces the number of sequence records to process, while preserving exact counts needed for downstream statistical analyses.

By default, two sequences are considered identical if and only if their nucleotide strings are the same. Using --category-attribute (repeatable), additional metadata fields can be included in the identity criterion. For example, grouping by sample name keeps the same sequence as separate records when it occurs in different samples, enabling per-sample abundance tracking.

For each group of identical sequences, obiuniq emits one output record carrying the merged metadata of all members. The --merge option (repeatable) instructs the command to also record, in an attribute named merged_<KEY>, the distribution of KEY attribute values across the sequences collapsed into each group — useful for provenance tracking and quality control.

Sequences that appear only once in the entire dataset (singletons) can be removed with --no-singleton. Singletons often represent sequencing errors rather than genuine biological variants, so their removal is a common noise-reduction step.

INPUT

obiuniq accepts biological sequence data in FASTA, FASTQ, EMBL, GenBank, ecoPCR, or CSV format (auto-detected by default, or forced with format flags such as --fasta, --fastq, --embl, etc.). Input is read from one or more files given as positional arguments, or from standard input when no files are provided.

When multiple input files are provided, obiuniq assumes they are ordered (e.g., paired-end reads in the same read order). If no such ordering exists, use --no-order to signal that files can be consumed independently.

FASTA/FASTQ header annotations are parsed heuristically by default. Use --input-OBI-header for OBI-formatted headers or --input-json-header for JSON-formatted headers. RNA sequences can be normalised to DNA on the fly with --u-to-t.

OUTPUT

obiuniq writes dereplicated sequences to standard output or to the file specified by --out. Each output record represents one group of identical sequences (identical under the chosen grouping criterion). The output carries the merged metadata from all input records in the group.

The output format defaults to FASTA. Even when the input contains quality scores (FASTQ), quality information is not preserved across merged sequences, so the output is written in FASTA format unless --fastq-output is explicitly requested. Output annotations follow the OBI header format when --output-OBI-header is set, or JSON when --output-json-header is set. The output can be gzip-compressed with --compress.

For each output record:

The abundance count reflects how many input sequences were merged into the group.
Attributes created by --merge KEY are named merged_KEY and map each observed value of the KEY attribute to the count of input sequences carrying that value within the group.
All other attributes are merged from the contributing records according to the standard OBITools4 merging rules.

Observed output example

>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt

OPTIONS

Dereplication Options

--category-attribute|-c <CATEGORY> (default: [])
Adds one metadata attribute to the grouping criterion. Two sequences are placed in the same group only when they are nucleotide-identical and share the same value for every attribute listed with -c. This option can be repeated to combine multiple attributes (e.g., -c sample -c primer). Records that lack a listed attribute receive the value set by --na-value.

--chunk-count <int> (default: 100)
Controls how many internal partitions the dataset is split into during processing. A higher value reduces per-partition memory usage at the cost of more temporary files; a lower value increases per-partition memory but reduces I/O overhead. Tune this when processing very large or very small datasets.

--in-memory (default: false)
Stores intermediate data chunks in RAM rather than in temporary disk files. Speeds up processing on datasets that fit comfortably in available memory; omit this flag (the default) for large datasets that exceed available RAM.

--merge|-m <KEY> (default: [])
Creates an output attribute named merged_KEY that maps each observed value of the KEY attribute to the count of input sequences carrying that value within the group. Repeat to track multiple attributes. Useful for tracking which sample or category contributions were collapsed into each group.

--na-value <NA_NAME> (default: "NA")
Value assigned to a category attribute when a sequence record does not carry that attribute. All sequences lacking the attribute are grouped together under this placeholder, rather than being treated as incomparable.

--no-singleton (default: false)
Discards all output records whose abundance count is exactly one — i.e., sequences that occur only once across the entire input. Removing singletons is a standard heuristic for excluding sequencing errors from further analysis.

Input Options

--batch-mem <string> (default: "", env: OBIBATCHMEM)
Maximum memory budget per processing batch (e.g. 128K, 64M, 1G). Set to 0 to disable the memory ceiling. Overrides --batch-size-max when both are set.

--batch-size <int> (default: 10, env: OBIBATCHSIZE)
Minimum number of sequences per batch (floor).

--batch-size-max <int> (default: 2000, env: OBIBATCHSIZEMAX)
Maximum number of sequences per batch (ceiling).

--csv (default: false)
Parse input as CSV format.

--ecopcr (default: false)
Parse input as ecoPCR output format.

--embl (default: false)
Parse input as EMBL flatfile format.

--fasta (default: false)
Parse input as FASTA format.

--fastq (default: false)
Parse input as FASTQ format.

--genbank (default: false)
Parse input as GenBank flatfile format.

--input-OBI-header (default: false)
Treat FASTA/FASTQ title line annotations as OBI-format key=value pairs.

--input-json-header (default: false)
Treat FASTA/FASTQ title line annotations as JSON objects.

--no-order (default: false)
When multiple input files are provided, indicates that there is no ordering relationship among them.

--skip-empty (default: false)
Suppress sequences of length zero from the output.

--solexa (default: false, env: OBISOLEXA)
Decode quality strings according to the Solexa specification rather than the standard Phred encoding.

--u-to-t (default: false)
Convert uracil (U) to thymine (T) in all input sequences, normalising RNA to DNA representation.

Output Options

--compress|-Z (default: false)
Compress output using gzip.

--fasta-output (default: false)
Write output in FASTA format (default when no quality scores are available).

--fastq-output (default: false)
Write output in FASTQ format (default when quality scores are present).

--json-output (default: false)
Write output in JSON format.

--out|-o <FILENAME> (default: "-")
Write output to the specified file instead of standard output.

--output-OBI-header|-O (default: false)
Write FASTA/FASTQ title line annotations in OBI format.

--output-json-header (default: false)
Write FASTA/FASTQ title line annotations in JSON format.

Taxonomy Options

--fail-on-taxonomy (default: false)
Cause obiuniq to exit with an error if a taxid in the data is not a currently valid taxon in the loaded taxonomy.

--raw-taxid (default: false)
Print taxids in output without supplementary information (taxon name and rank).

--taxonomy|-t <string> (default: "")
Path to the taxonomy database used to validate or update taxids.

--update-taxid (default: false)
Automatically replace merged taxids with the most recent valid taxid.

--with-leaves (default: false)
When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation.

Execution Options

--max-cpu <int> (default: 16, env: OBIMAXCPU)
Number of parallel threads used to compute the result.

--debug (default: false, env: OBIDEBUG)
Enable debug mode by setting the log level to debug.

--no-progressbar (default: false)
Disable the progress bar.

--silent-warning (default: false, env: OBIWARNING)
Suppress warning messages.

--pprof (default: false)
Enable the pprof profiling server (address logged at startup).

--pprof-goroutine <int> (default: 6060, env: OBIPPROFGOROUTINE)
Port for the goroutine blocking profile endpoint.

--pprof-mutex <int> (default: 10, env: OBIPPROFMUTEX)
Rate for the mutex contention profile.

--version (default: false)
Print the version string and exit.

--help|-h|-? (default: false)
Print usage information and exit.

EXAMPLES

# Dereplicate a FASTQ file of amplicon reads; write unique sequences to a FASTA output file.
obiuniq reads.fastq -o out_basic.fastq

Expected output: 4 sequences written to out_basic.fastq.

# Dereplicate keeping sequences separate per sample (category attribute),
# and discard singletons to remove likely sequencing errors.
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq

Expected output: 2 sequences written to out_no_singleton.fastq.

# Dereplicate per sample, recording the sample distribution in 'merged_sample',
# and use 'UNKNOWN' for reads missing the sample attribute.
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq

Expected output: 5 sequences written to out_merge.fastq.

# Process a dataset entirely in memory using 200 internal partitions,
# writing gzip-compressed output.
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq

Expected output: 4 sequences written to out_inmemory.fastq.gz.

# Dereplicate reads from two sample files with no assumed ordering between them,
# grouping by both sample and primer attributes.
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq

Expected output: 4 sequences written to out_multifile.fastq.

NOTES

For datasets that do not fit in RAM, obiuniq uses temporary disk-backed chunk files by default. The number of chunks is controlled by --chunk-count (default 100). Increasing this value lowers per-chunk memory requirements; decreasing it reduces I/O at the cost of higher peak memory. Use --in-memory only when the full working set fits in available RAM, as exceeding memory will degrade performance or cause out-of-memory failures.

Singletons (sequences with abundance = 1) are a common source of noise in amplicon sequencing, often arising from PCR or sequencing errors. The --no-singleton flag is therefore recommended for most metabarcoding workflows, unless the study design requires retaining all observed variants.

When the --category-attribute option is used, records that lack the specified attribute are grouped together under the --na-value placeholder (default "NA"). This ensures that all records participate in dereplication without being silently dropped, but users should be aware that heterogeneous records with different missing attributes may be unintentionally merged.

14 KiB Raw Blame History