mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

10 KiB

Raw Blame History

obidemerge

NAME

obidemerge — split merged sequence records back into individual, sample-annotated copies

SYNOPSIS

obidemerge [options] [input_files...]

DESCRIPTION

In a typical metabarcoding workflow, obiuniq or similar tools collapse identical sequences from multiple samples into a single representative record. That record carries a statistics attribute (for example merged_sample) that stores, for every original sample, how many times the sequence was observed. This compact representation is convenient for clustering and denoising, but some downstream analyses need the original, per-sample view.

obidemerge reverses that merging step. For each input sequence, it reads the statistics stored under a chosen attribute (by default sample) and produces one output sequence per entry in that statistics map. Each output sequence is a copy of the original, but:

its sample attribute (or whichever slot you chose) is set to the name of the individual sample,
its read count is set to the abundance recorded for that sample.

The original statistics attribute is removed from all output sequences.

Sequences that carry no statistics for the chosen slot are passed through unchanged.

The command reads sequences from one or more files, or from standard input when no file is given, and writes the results to standard output or to the file specified with --out.

INPUT FORMATS

obidemerge accepts all sequence formats supported by OBITools4:

Format	Description
FASTA	Plain nucleotide sequences with annotation in the title line
FASTQ	Sequences with per-base quality scores
EMBL	European Nucleotide Archive flat-file format
GenBank	NCBI GenBank flat-file format
ecoPCR	Output produced by the ecoPCR tool
CSV	Comma-separated values with sequence and metadata columns

The format is detected automatically from the file extension or content. You can override detection with the format flags listed under Input format options below.

Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style (--input-OBI-header) or JSON style (--input-json-header).

OUTPUT FORMATS

By default, the output format mirrors the input:

If the input contains quality scores, output is FASTQ.
Otherwise, output is FASTA with OBI-style annotations.

You can force a specific format with --fasta-output, --fastq-output, or --json-output.

OPTIONS

Demerge option

--demerge <slot>, -d <slot>: Name of the sequence attribute that holds the per-sample statistics to expand. Each key in that statistics map becomes a separate output sequence. Default: sample

Output options

--out <FILENAME>, -o <FILENAME>: Write output to this file instead of standard output. Use - for standard output. Default: - (standard output)
--fasta-output: Write output in FASTA format, even when quality scores are available. Default: false
--fastq-output: Write output in FASTQ format (requires quality scores in the input). Default: false
--json-output: Write output in JSON format, one record per line. Default: false
--output-OBI-header, -O: Write FASTA/FASTQ title lines in OBI key=value annotation style. Default: false (JSON-style headers)
--output-json-header: Write FASTA/FASTQ title lines in JSON annotation style. Default: false
--compress, -Z: Compress the output with gzip. Default: false
--skip-empty: Discard sequences of length zero from the output. Default: false

Input format options

--fasta: Force reading in FASTA format.
--fastq: Force reading in FASTQ format.
--embl: Force reading in EMBL flat-file format.
--genbank: Force reading in GenBank flat-file format.
--ecopcr: Force reading in ecoPCR output format.
--csv: Force reading in CSV format.
--input-OBI-header: Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
--input-json-header: Parse FASTA/FASTQ title lines as JSON annotations.
--solexa: Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard Phred scale. Use this only for very old sequencing data. Default: false
--u-to-t: Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived data that should be treated as DNA. Default: false
--no-order: When reading from several input files, do not attempt to preserve the order of records across files. May improve speed when order does not matter. Default: false

Taxonomy options

--taxonomy <path>, -t <path>: Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to be resolved or validated during output. Default: none
--fail-on-taxonomy: Stop with an error if a taxonomic identifier in the data is not found in the loaded taxonomy database. Default: false
--raw-taxid: Print taxonomic identifiers as plain numbers, without appending the taxon name and rank. Default: false
--update-taxid: Automatically replace deprecated taxonomic identifiers with their current equivalents, as declared in the taxonomy database. Default: false
--with-leaves: When a taxonomy is extracted from the sequence file itself, treat each sequence as a leaf node under its annotated taxonomic identifier. Default: false

Performance options

--max-cpu <int>: Maximum number of parallel processing threads. Increase for faster processing on multi-core machines. Default: 16 (or the value of the OBIMAXCPU environment variable)
--batch-size <int>: Minimum number of sequences processed together as a group. Default: 1
--batch-size-max <int>: Maximum number of sequences processed together as a group. Default: 2000
--batch-mem <size>: Maximum memory used per processing group (e.g. 64M, 1G). Set to 0 to disable the memory limit and rely on --batch-size-max alone. Default: 128M

Display options

--no-progressbar: Hide the progress bar. Default: false
--silent-warning: Suppress warning messages. Default: false
--debug: Enable verbose debug logging. Default: false
--version: Print the OBITools4 version and exit.
--help, -h, -?: Print this help message and exit.

EXAMPLES

Example 1 — basic demerge using the default slot

After running obiuniq, the file unique.fasta contains merged sequences whose merged_sample attribute records abundance per sample. Demerge back to one sequence per sample:

obidemerge -d sample unique.fasta > per_sample_merged.fasta

Expected output: 7 sequences written to per_sample_merged.fasta.

Example 2 — demerge with the default `sample` slot

If the statistics are already stored under the attribute named sample (the default), no -d flag is needed:

obidemerge unique.fasta > per_sample_default.fasta

Expected output: 7 sequences written to per_sample_default.fasta.

Example 3 — write compressed output to a file

obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta

Expected output: 7 sequences written (compressed) to per_sample.fasta.gz.

Example 4 — pipeline use: cluster, then demerge

Obtain unique sequences, cluster them, then expand the clusters back to individual sample records for ecological analysis:

obiuniq -m sample reads.fastq \
  | obiclean ... \
  | obidemerge -d sample -o demerged.fasta

Example 5 — process multiple input files

obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta

Expected output: 6 sequences written to combined_demerged.fasta.

NOTES

Relationship to obiuniq. obiuniq --merge sample stores per-sample counts under an attribute named merged_sample. When you later call obidemerge, you must therefore pass -d sample to match that attribute name. The -d option takes the logical slot name (here sample), not the internal storage name (merged_sample).

Read counts after demerging. Each output sequence has its read count set to the value recorded in the statistics map for that sample. If you sum the counts of all output sequences that share the same identifier, you recover the total count of the original merged record.

Order of output sequences. The order in which the per-sample copies of a single merged sequence appear in the output is not guaranteed. If a stable order is required, pipe the output through obisort.

OUTPUT

obidemerge writes one sequence record per sample entry found in the statistics attribute. Each output record is a copy of the input sequence, with:

the statistics attribute (merged_<slot>) removed,
the <slot> attribute set to the sample name,
the count attribute set to the abundance for that sample.

Sequences with no statistics for the chosen slot are passed through unchanged.

Observed output example

>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc

10 KiB Raw Blame History