Files
obitools4/autodoc/cmd/obidemerge.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

10 KiB

obidemerge

NAME

obidemerge — split merged sequence records back into individual, sample-annotated copies

SYNOPSIS

obidemerge [options] [input_files...]

DESCRIPTION

In a typical metabarcoding workflow, obiuniq or similar tools collapse identical sequences from multiple samples into a single representative record. That record carries a statistics attribute (for example merged_sample) that stores, for every original sample, how many times the sequence was observed. This compact representation is convenient for clustering and denoising, but some downstream analyses need the original, per-sample view.

obidemerge reverses that merging step. For each input sequence, it reads the statistics stored under a chosen attribute (by default sample) and produces one output sequence per entry in that statistics map. Each output sequence is a copy of the original, but:

  • its sample attribute (or whichever slot you chose) is set to the name of the individual sample,
  • its read count is set to the abundance recorded for that sample.

The original statistics attribute is removed from all output sequences.

Sequences that carry no statistics for the chosen slot are passed through unchanged.

The command reads sequences from one or more files, or from standard input when no file is given, and writes the results to standard output or to the file specified with --out.

INPUT FORMATS

obidemerge accepts all sequence formats supported by OBITools4:

Format Description
FASTA Plain nucleotide sequences with annotation in the title line
FASTQ Sequences with per-base quality scores
EMBL European Nucleotide Archive flat-file format
GenBank NCBI GenBank flat-file format
ecoPCR Output produced by the ecoPCR tool
CSV Comma-separated values with sequence and metadata columns

The format is detected automatically from the file extension or content. You can override detection with the format flags listed under Input format options below.

Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style (--input-OBI-header) or JSON style (--input-json-header).

OUTPUT FORMATS

By default, the output format mirrors the input:

  • If the input contains quality scores, output is FASTQ.
  • Otherwise, output is FASTA with OBI-style annotations.

You can force a specific format with --fasta-output, --fastq-output, or --json-output.

OPTIONS

Demerge option

--demerge <slot>, -d <slot>
Name of the sequence attribute that holds the per-sample statistics to expand. Each key in that statistics map becomes a separate output sequence. Default: sample

Output options

--out <FILENAME>, -o <FILENAME>
Write output to this file instead of standard output. Use - for standard output. Default: - (standard output)
--fasta-output
Write output in FASTA format, even when quality scores are available. Default: false
--fastq-output
Write output in FASTQ format (requires quality scores in the input). Default: false
--json-output
Write output in JSON format, one record per line. Default: false
--output-OBI-header, -O
Write FASTA/FASTQ title lines in OBI key=value annotation style. Default: false (JSON-style headers)
--output-json-header
Write FASTA/FASTQ title lines in JSON annotation style. Default: false
--compress, -Z
Compress the output with gzip. Default: false
--skip-empty
Discard sequences of length zero from the output. Default: false

Input format options

--fasta
Force reading in FASTA format.
--fastq
Force reading in FASTQ format.
--embl
Force reading in EMBL flat-file format.
--genbank
Force reading in GenBank flat-file format.
--ecopcr
Force reading in ecoPCR output format.
--csv
Force reading in CSV format.
--input-OBI-header
Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
--input-json-header
Parse FASTA/FASTQ title lines as JSON annotations.
--solexa
Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard Phred scale. Use this only for very old sequencing data. Default: false
--u-to-t
Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived data that should be treated as DNA. Default: false
--no-order
When reading from several input files, do not attempt to preserve the order of records across files. May improve speed when order does not matter. Default: false

Taxonomy options

--taxonomy <path>, -t <path>
Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to be resolved or validated during output. Default: none
--fail-on-taxonomy
Stop with an error if a taxonomic identifier in the data is not found in the loaded taxonomy database. Default: false
--raw-taxid
Print taxonomic identifiers as plain numbers, without appending the taxon name and rank. Default: false
--update-taxid
Automatically replace deprecated taxonomic identifiers with their current equivalents, as declared in the taxonomy database. Default: false
--with-leaves
When a taxonomy is extracted from the sequence file itself, treat each sequence as a leaf node under its annotated taxonomic identifier. Default: false

Performance options

--max-cpu <int>
Maximum number of parallel processing threads. Increase for faster processing on multi-core machines. Default: 16 (or the value of the OBIMAXCPU environment variable)
--batch-size <int>
Minimum number of sequences processed together as a group. Default: 1
--batch-size-max <int>
Maximum number of sequences processed together as a group. Default: 2000
--batch-mem <size>
Maximum memory used per processing group (e.g. 64M, 1G). Set to 0 to disable the memory limit and rely on --batch-size-max alone. Default: 128M

Display options

--no-progressbar
Hide the progress bar. Default: false
--silent-warning
Suppress warning messages. Default: false
--debug
Enable verbose debug logging. Default: false
--version
Print the OBITools4 version and exit.
--help, -h, -?
Print this help message and exit.

EXAMPLES

Example 1 — basic demerge using the default slot

After running obiuniq, the file unique.fasta contains merged sequences whose merged_sample attribute records abundance per sample. Demerge back to one sequence per sample:

obidemerge -d sample unique.fasta > per_sample_merged.fasta

Expected output: 7 sequences written to per_sample_merged.fasta.

Example 2 — demerge with the default sample slot

If the statistics are already stored under the attribute named sample (the default), no -d flag is needed:

obidemerge unique.fasta > per_sample_default.fasta

Expected output: 7 sequences written to per_sample_default.fasta.

Example 3 — write compressed output to a file

obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta

Expected output: 7 sequences written (compressed) to per_sample.fasta.gz.

Example 4 — pipeline use: cluster, then demerge

Obtain unique sequences, cluster them, then expand the clusters back to individual sample records for ecological analysis:

obiuniq -m sample reads.fastq \
  | obiclean ... \
  | obidemerge -d sample -o demerged.fasta

Example 5 — process multiple input files

obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta

Expected output: 6 sequences written to combined_demerged.fasta.

SEE ALSO

obiuniq(1) — collapses identical sequences and records per-sample counts (the inverse operation) obiclean(1) — removes PCR/sequencing artefacts from a set of unique sequences obiannotate(1) — adds or modifies sequence attributes obigrep(1) — filters sequences by attributes or sequence content obicount(1) — counts sequences and total reads in a file

NOTES

Relationship to obiuniq. obiuniq --merge sample stores per-sample counts under an attribute named merged_sample. When you later call obidemerge, you must therefore pass -d sample to match that attribute name. The -d option takes the logical slot name (here sample), not the internal storage name (merged_sample).

Read counts after demerging. Each output sequence has its read count set to the value recorded in the statistics map for that sample. If you sum the counts of all output sequences that share the same identifier, you recover the total count of the original merged record.

Order of output sequences. The order in which the per-sample copies of a single merged sequence appear in the output is not guaranteed. If a stable order is required, pipe the output through obisort.

OUTPUT

obidemerge writes one sequence record per sample entry found in the statistics attribute. Each output record is a copy of the input sequence, with:

  • the statistics attribute (merged_<slot>) removed,
  • the <slot> attribute set to the sample name,
  • the count attribute set to the abundance for that sample.

Sequences with no statistics for the chosen slot are passed through unchanged.

Observed output example

>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc