mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,300 @@
|
||||
# NAME
|
||||
|
||||
obicomplement — reverse complement of sequences
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obicomplement [--batch-mem <string>] [--batch-size <int>]
|
||||
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
|
||||
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
|
||||
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
|
||||
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
|
||||
[--json-output] [--max-cpu <int>] [--no-order]
|
||||
[--no-progressbar] [--out|-o <FILENAME>]
|
||||
[--output-OBI-header|-O] [--output-json-header]
|
||||
[--paired-with <FILENAME>] [--raw-taxid] [--silent-warning]
|
||||
[--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
|
||||
[--update-taxid] [--with-leaves] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obicomplement` computes the reverse complement of every sequence in the
|
||||
input. For each input sequence, the nucleotides are first reversed, then
|
||||
each base is replaced by its Watson–Crick complement (A↔T, C↔G), yielding
|
||||
the strand that would pair with the original sequence read in the opposite
|
||||
direction.
|
||||
|
||||
When quality scores are present (FASTQ data), they are reversed in the same
|
||||
order as the sequence so that each quality value remains associated with its
|
||||
corresponding base. Ambiguous IUPAC characters (e.g. `N`, `R`, `Y`) are
|
||||
handled correctly and preserved in the output.
|
||||
|
||||
This operation is commonly needed when sequences have been sequenced on the
|
||||
wrong strand, when a primer is designed on the reverse strand, or when
|
||||
preparing sequences for strand-aware downstream analyses.
|
||||
|
||||
The command reads from standard input or from one or more files, processes
|
||||
sequences in parallel, and writes the result to standard output or to the
|
||||
file specified with `--out`.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obicomplement` accepts biological sequence data in FASTA, FASTQ, EMBL,
|
||||
GenBank, ecoPCR output, and CSV formats. When no format flag is given, the
|
||||
format is inferred automatically from the file contents or extension.
|
||||
|
||||
Input is read from standard input when no filename argument is provided, or
|
||||
from one or more files passed as positional arguments. Gzip-compressed files
|
||||
are handled transparently.
|
||||
|
||||
Paired-end data can be provided with `--paired-with`, which specifies the
|
||||
file containing the second mate. Both mates are reverse-complemented and
|
||||
written to separate output files.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
The output is a sequence file in which every sequence is the reverse
|
||||
complement of the corresponding input sequence. The output format matches
|
||||
the input by default (FASTA if no quality data, FASTQ if quality data are
|
||||
present), and can be overridden with `--fasta-output`, `--fastq-output`, or
|
||||
`--json-output`.
|
||||
|
||||
All annotations (attributes stored in the sequence header) are preserved
|
||||
unchanged. Quality scores, when present, are reversed to stay aligned with
|
||||
their bases.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"definition":"basic DNA sequence"}
|
||||
cgatcgatcgatcgatcgat
|
||||
>seq002 {"definition":"GC-rich sequence"}
|
||||
gcgcgcgcgcgcgcgcgcgc
|
||||
>seq003 {"definition":"AT-rich sequence"}
|
||||
atatatatatatatatatat
|
||||
>seq004 {"definition":"palindromic sequence"}
|
||||
aattccggaattccggaatt
|
||||
>seq005 {"definition":"mixed sequence"}
|
||||
agctagcatgcatagccgat
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Input format
|
||||
|
||||
**`--fasta`**
|
||||
: Default: false. Force parsing of input as FASTA format.
|
||||
|
||||
**`--fastq`**
|
||||
: Default: false. Force parsing of input as FASTQ format.
|
||||
|
||||
**`--embl`**
|
||||
: Default: false. Force parsing of input as EMBL flatfile format.
|
||||
|
||||
**`--genbank`**
|
||||
: Default: false. Force parsing of input as GenBank flatfile format.
|
||||
|
||||
**`--ecopcr`**
|
||||
: Default: false. Force parsing of input as ecoPCR output format.
|
||||
|
||||
**`--csv`**
|
||||
: Default: false. Force parsing of input as CSV format.
|
||||
|
||||
**`--solexa`**
|
||||
: Default: false. Decode quality scores using the Solexa/Illumina pre-1.3
|
||||
convention instead of the standard Phred+33 encoding.
|
||||
|
||||
**`--input-OBI-header`**
|
||||
: Default: false. Interpret FASTA/FASTQ header annotations using the OBI
|
||||
key=value format.
|
||||
|
||||
**`--input-json-header`**
|
||||
: Default: false. Interpret FASTA/FASTQ header annotations using JSON
|
||||
format.
|
||||
|
||||
**`--no-order`**
|
||||
: Default: false. When several input files are given, declare that no
|
||||
ordering relationship exists among them, allowing the reader to interleave
|
||||
records freely.
|
||||
|
||||
**`--paired-with <FILENAME>`**
|
||||
: Default: none. File containing the paired (R2) reads. When set,
|
||||
`obicomplement` processes both mates and writes them to separate output
|
||||
files.
|
||||
|
||||
## Sequence preprocessing
|
||||
|
||||
**`--u-to-t`**
|
||||
: Default: false. Convert Uracil (U) to Thymine (T) before computing the
|
||||
reverse complement. Useful when processing RNA sequences that must be
|
||||
treated as DNA.
|
||||
|
||||
**`--skip-empty`**
|
||||
: Default: false. Discard sequences of length zero from the output.
|
||||
|
||||
## Output format
|
||||
|
||||
**`--fasta-output`**
|
||||
: Default: false. Write output in FASTA format regardless of whether quality
|
||||
scores are present.
|
||||
|
||||
**`--fastq-output`**
|
||||
: Default: false. Write output in FASTQ format (requires quality data).
|
||||
|
||||
**`--json-output`**
|
||||
: Default: false. Write output in JSON format.
|
||||
|
||||
**`--out|-o <FILENAME>`**
|
||||
: Default: `-` (standard output). File used to save the output.
|
||||
|
||||
**`--output-OBI-header|-O`**
|
||||
: Default: false. Write FASTA/FASTQ header annotations in OBI key=value
|
||||
format.
|
||||
|
||||
**`--output-json-header`**
|
||||
: Default: false. Write FASTA/FASTQ header annotations in JSON format.
|
||||
|
||||
**`--compress|-Z`**
|
||||
: Default: false. Compress the output with gzip.
|
||||
|
||||
## Taxonomy
|
||||
|
||||
**`--taxonomy|-t <string>`**
|
||||
: Default: none. Path to a taxonomy database. Required only when the input
|
||||
sequences carry taxid annotations that need to be validated or updated.
|
||||
|
||||
**`--fail-on-taxonomy`**
|
||||
: Default: false. Cause `obicomplement` to exit with an error if a taxid
|
||||
referenced in the data is not a currently valid node in the loaded
|
||||
taxonomy.
|
||||
|
||||
**`--update-taxid`**
|
||||
: Default: false. Automatically replace taxids that have been declared
|
||||
merged into a newer node by the taxonomy database.
|
||||
|
||||
**`--raw-taxid`**
|
||||
: Default: false. Print taxids without appending the taxon name and rank.
|
||||
|
||||
**`--with-leaves`**
|
||||
: Default: false. When the taxonomy is extracted from the sequence file,
|
||||
attach sequences as leaves of their taxid node.
|
||||
|
||||
## Performance and diagnostics
|
||||
|
||||
**`--max-cpu <int>`**
|
||||
: Default: 16 (env: `OBIMAXCPU`). Number of parallel threads used to
|
||||
process sequences.
|
||||
|
||||
**`--batch-size <int>`**
|
||||
: Default: 1 (env: `OBIBATCHSIZE`). Minimum number of sequences per
|
||||
processing batch.
|
||||
|
||||
**`--batch-size-max <int>`**
|
||||
: Default: 2000 (env: `OBIBATCHSIZEMAX`). Maximum number of sequences per
|
||||
processing batch.
|
||||
|
||||
**`--batch-mem <string>`**
|
||||
: Default: `128M` (env: `OBIBATCHMEM`). Maximum memory allocated per batch
|
||||
(e.g. `128K`, `64M`, `1G`). Set to `0` to disable the memory limit.
|
||||
|
||||
**`--no-progressbar`**
|
||||
: Default: false. Disable the progress bar printed to stderr.
|
||||
|
||||
**`--silent-warning`**
|
||||
: Default: false (env: `OBIWARNING`). Suppress warning messages.
|
||||
|
||||
**`--debug`**
|
||||
: Default: false (env: `OBIDEBUG`). Enable debug logging.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Reverse complement all sequences in a FASTA file
|
||||
obicomplement sequences.fasta > out_default.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_default.fasta`.
|
||||
|
||||
```bash
|
||||
# Reverse complement a FASTQ file, preserving quality scores
|
||||
obicomplement reads.fastq --fastq-output --out out_fastq.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_fastq.fastq`.
|
||||
|
||||
```bash
|
||||
# Convert RNA sequences to their reverse complement DNA strand
|
||||
obicomplement --u-to-t rna_sequences.fasta > out_rna_rc.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_rna_rc.fasta`.
|
||||
|
||||
```bash
|
||||
# Reverse complement paired-end reads into two separate output files
|
||||
obicomplement R1.fastq --paired-with R2.fastq --out out_paired.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_paired_R1.fastq` and 3 sequences to `out_paired_R2.fastq`.
|
||||
|
||||
```bash
|
||||
# Reverse complement and compress output, skipping any empty sequences
|
||||
obicomplement --skip-empty --compress sequences.fasta --out out_compressed.fasta.gz
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_compressed.fasta.gz` (gzip-compressed FASTA).
|
||||
|
||||
```bash
|
||||
# Reverse complement with OBI-format header output
|
||||
obicomplement --output-OBI-header sequences.fasta --out out_obi.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_obi.fasta`.
|
||||
|
||||
```bash
|
||||
# Reverse complement with explicit JSON-format header output
|
||||
obicomplement --output-json-header sequences.fasta --out out_jsonheader.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_jsonheader.fasta`.
|
||||
|
||||
```bash
|
||||
# Reverse complement and write full JSON output format
|
||||
obicomplement --json-output sequences.fasta --out out_json.json
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_json.json`.
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
- `obiconvert` — format conversion and sequence filtering pipeline
|
||||
- `obipairing` — paired-end read merging (uses reverse complement internally)
|
||||
- `obigrep` — sequence filtering and selection
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
Quality scores (Phred-scaled) are reversed in lock-step with the sequence
|
||||
so that positional quality information remains valid after the reverse
|
||||
complement operation. This is essential for downstream tools that rely on
|
||||
per-base quality for alignment or variant calling.
|
||||
|
||||
Ambiguous IUPAC characters and gap symbols (`-`) are handled gracefully:
|
||||
standard ambiguous bases are complemented according to IUPAC rules, while
|
||||
gap and missing-data symbols are preserved unchanged.
|
||||
@@ -0,0 +1,188 @@
|
||||
# obiconsensus(1) — OBITools4 Manual
|
||||
|
||||
## NAME
|
||||
|
||||
`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
|
||||
|
||||
## SYNOPSIS
|
||||
|
||||
```
|
||||
obiconsensus [OPTIONS] [FILE...]
|
||||
```
|
||||
|
||||
## DESCRIPTION
|
||||
|
||||
`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.
|
||||
|
||||
The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.
|
||||
|
||||
Two denoising strategies are available:
|
||||
|
||||
- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
|
||||
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
|
||||
|
||||
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.
|
||||
|
||||
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.
|
||||
|
||||
## INPUT FORMATS
|
||||
|
||||
`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
|
||||
|
||||
| Flag | Format |
|
||||
|------|--------|
|
||||
| `--fasta` | FASTA |
|
||||
| `--fastq` | FASTQ |
|
||||
| `--embl` | EMBL flat file |
|
||||
| `--genbank` | GenBank flat file |
|
||||
| `--ecopcr` | ecoPCR output |
|
||||
| `--csv` | CSV tabular format |
|
||||
|
||||
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
|
||||
|
||||
## OUTPUT FORMATS
|
||||
|
||||
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
|
||||
|
||||
- `--fasta-output` — write FASTA
|
||||
- `--fastq-output` — write FASTQ
|
||||
- `--json-output` — write JSON
|
||||
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
|
||||
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
|
||||
- `--compress` / `-Z` — compress output with gzip
|
||||
|
||||
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
|
||||
|
||||
## DENOISING OPTIONS
|
||||
|
||||
`--distance INT`, `-d INT`
|
||||
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
|
||||
|
||||
`--cluster`, `-C`
|
||||
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
|
||||
|
||||
`--kmer-size SIZE`
|
||||
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.
|
||||
|
||||
`--no-singleton`
|
||||
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
|
||||
|
||||
`--low-coverage FLOAT`
|
||||
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
|
||||
|
||||
`--sample ATTRIBUTE`, `-s ATTRIBUTE`
|
||||
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.
|
||||
|
||||
## OUTPUT ANNOTATION OPTIONS
|
||||
|
||||
`--unique`, `-U`
|
||||
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.
|
||||
|
||||
`--save-graph DIRECTORY`
|
||||
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
|
||||
|
||||
`--save-ratio FILE`
|
||||
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
|
||||
|
||||
## PERFORMANCE OPTIONS
|
||||
|
||||
`--max-cpu INT`
|
||||
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
|
||||
|
||||
`--batch-size INT`
|
||||
: Minimum number of sequences processed together in a single batch. Default: 1.
|
||||
|
||||
`--batch-size-max INT`
|
||||
: Maximum number of sequences in a single batch. Default: 2000.
|
||||
|
||||
`--batch-mem STRING`
|
||||
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.
|
||||
|
||||
`--no-progressbar`
|
||||
: Disable the progress bar.
|
||||
|
||||
`--no-order`
|
||||
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
|
||||
|
||||
## OTHER OPTIONS
|
||||
|
||||
`--u-to-t`
|
||||
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
|
||||
|
||||
`--skip-empty`
|
||||
: Remove sequences of length zero from the output.
|
||||
|
||||
`--solexa`
|
||||
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
|
||||
|
||||
`--silent-warning`
|
||||
: Suppress warning messages.
|
||||
|
||||
`--debug`
|
||||
: Enable detailed logging for troubleshooting.
|
||||
|
||||
`--version`
|
||||
: Print the version number and exit.
|
||||
|
||||
`--help`, `-h`
|
||||
: Display a brief help message and exit.
|
||||
|
||||
## OUTPUT ATTRIBUTES
|
||||
|
||||
Each output consensus sequence carries several annotation attributes describing how it was built:
|
||||
|
||||
| Attribute | Description |
|
||||
|-----------|-------------|
|
||||
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
|
||||
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
|
||||
| `count` | Total number of reads merged into this consensus across all samples |
|
||||
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
|
||||
| `seq_length` | Length of the consensus sequence |
|
||||
|
||||
## EXAMPLES
|
||||
|
||||
**Basic denoising of a FASTQ file:**
|
||||
|
||||
```sh
|
||||
obiconsensus reads.fastq > denoised.fastq
|
||||
```
|
||||
|
||||
**Increase the allowed distance between reads to 2:**
|
||||
|
||||
```sh
|
||||
obiconsensus --distance 2 reads.fastq > denoised.fastq
|
||||
```
|
||||
|
||||
**Use clustering mode and remove singletons:**
|
||||
|
||||
```sh
|
||||
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
|
||||
```
|
||||
|
||||
**Denoise, then dereplicate the output:**
|
||||
|
||||
```sh
|
||||
obiconsensus --unique reads.fastq > denoised_uniq.fastq
|
||||
```
|
||||
|
||||
**Save denoising graphs for inspection:**
|
||||
|
||||
```sh
|
||||
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
|
||||
```
|
||||
|
||||
**Specify the sample annotation attribute:**
|
||||
|
||||
```sh
|
||||
obiconsensus --sample library reads.fastq > denoised.fastq
|
||||
```
|
||||
|
||||
## SEE ALSO
|
||||
|
||||
`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)
|
||||
|
||||
## NOTES
|
||||
|
||||
`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.
|
||||
|
||||
The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
|
||||
@@ -0,0 +1,179 @@
|
||||
# NAME
|
||||
|
||||
obiconvert — convertion of sequence files to various formats
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obiconvert [--batch-mem <string>] [--batch-size <int>]
|
||||
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
|
||||
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
|
||||
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
|
||||
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
|
||||
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
|
||||
[--out|-o <FILENAME>] [--output-OBI-header|-O]
|
||||
[--output-json-header] [--paired-with <FILENAME>] [--pprof]
|
||||
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--raw-taxid]
|
||||
[--silent-warning] [--skip-empty] [--solexa]
|
||||
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
|
||||
[--with-leaves] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
obiconvert is a versatile command-line tool that converts biological sequence data between multiple standard bioinformatics formats. It enables biologists to process large datasets by reading from one format and writing to another, with support for quality scores, taxonomic annotations, and various input/output combinations. The tool is optimized for high-performance processing with configurable batching, parallel execution, and memory management.
|
||||
|
||||
Biologists use obiconvert to standardize sequence data for compatibility with different bioinformatics tools, extract quality information from FASTQ files into more readable formats, or convert between FASTA and FASTQ when working with DNA/RNA sequences that have associated quality data. The tool automatically detects input formats and intelligently selects output formats based on data presence (e.g., FASTQ when quality scores exist, FASTA otherwise). To force a specific output format regardless of input content, use the explicit output flags (`--fasta-output`, `--fastq-output`, `--json-output`). <!-- corrected: without --fasta-output, a FASTQ input with quality scores stays in FASTQ format even when the output filename has a .fasta extension -->
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
obiconvert accepts input in multiple biological sequence formats:
|
||||
|
||||
- **FASTA**: Standard text-based format with `>` headers and sequence data
|
||||
- **FASTQ**: Binary quality-score format (default when both sequence and quality data present)
|
||||
- **GenBank**: Comprehensive biological record format with annotations
|
||||
- **EMBL**: EMBL flatfile format for sequence and feature information
|
||||
- **ecoPCR**: Specialized output format from ecoPCR amplification tools
|
||||
- **CSV**: Tabular sequence data with configurable delimiters
|
||||
|
||||
Input is provided as positional arguments (file paths or `-` for stdin). The tool automatically detects the input format based on file content and can handle multiple files in sequence. When paired-end sequencing is used, the `--paired-with` option specifies the mate read file.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
obiconvert produces sequence data in several output formats depending on input content and user selection:
|
||||
|
||||
- **FASTA**: Text format with sequence only (use `--fasta-output` to force)
|
||||
- **FASTQ**: Format including quality scores (default when quality data present; use `--fastq-output` to force)
|
||||
- **JSON**: Structured output with all sequence metadata and attributes (use `--json-output`)
|
||||
|
||||
The tool preserves all sequence annotations (taxonomic information, custom attributes) during conversion. When converting to FASTA/FASTQ formats, title line annotations can be formatted as OBI structured data or JSON using the `--output-OBI-header`/`--output-json-header` flags. Sequences of zero length can be suppressed with `--skip-empty`.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
|
||||
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
|
||||
>seq002 {"definition":"Second sequence with moderate quality scores"}
|
||||
gctagctagctagctagctagctagctagctagctagct
|
||||
>seq003 {"definition":"Third sequence with high quality scores"}
|
||||
ttaaccggttaaccggttaaccggttaaccggttaaccg
|
||||
>seq004 {"definition":"Fourth sequence with variable quality scores"}
|
||||
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Input Format Options
|
||||
- **--fasta**: Read data following the fasta format. (default: false)
|
||||
- **--fastq**: Read data following the fastq format. (default: false)
|
||||
- **--genbank**: Read data following the Genbank flatfile format. (default: false)
|
||||
- **--embl**: Read data following the EMBL flatfile format. (default: false)
|
||||
- **--ecopcr**: Read data following the ecoPCR output format. (default: false)
|
||||
- **--csv**: Read data following the CSV format. (default: false)
|
||||
|
||||
## Input Header Options
|
||||
- **--input-OBI-header**: FASTA/FASTQ title line annotations follow OBI format. (default: false)
|
||||
- **--input-json-header**: FASTA/FASTQ title line annotations follow json format. (default: false)
|
||||
|
||||
## Output Format Options
|
||||
- **--fasta-output**: Write sequence in fasta format (default if no quality data available). (default: false)
|
||||
- **--fastq-output**: Write sequence in fastq format (default if quality data available). (default: false)
|
||||
- **--json-output**: Write sequence in json format. (default: false)
|
||||
|
||||
## Output Header Options
|
||||
- **--output-OBI-header|-O**: output FASTA/FASTQ title line annotations follow OBI format. (default: false)
|
||||
- **--output-json-header**: output FASTA/FASTQ title line annotations follow json format. (default: false)
|
||||
|
||||
## Processing Options
|
||||
- **--skip-empty**: Sequences of length equal to zero are suppressed from the output (default: false)
|
||||
- **--no-order**: When several input files are provided, indicates that there is no order among them. (default: false)
|
||||
- **--u-to-t**: Convert Uracil to Thymine. (default: false)
|
||||
- **--update-taxid**: Make obitools automatically updating the taxid that are declared merged to a newest one. (default: false)
|
||||
- **--raw-taxid**: When set, taxids are printed in files with any supplementary information (taxon name and rank) (default: false)
|
||||
- **--fail-on-taxonomy**: Make obitools failing on error if a used taxid is not a currently valid one (default: false)
|
||||
- **--with-leaves**: If taxonomy is extracted from a sequence file, sequences are added as leave of their taxid annotation (default: false)
|
||||
|
||||
## File Options
|
||||
- **--out|-o <FILENAME>**: Filename used for saving the output (default: "-")
|
||||
- **--paired-with <FILENAME>**: Filename containing the paired reads (default: "")
|
||||
|
||||
## Performance Options
|
||||
- **--batch-mem <string>**: Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
|
||||
- **--batch-size <int>**: Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
|
||||
- **--batch-size-max <int>**: Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
|
||||
- **--max-cpu <int>**: Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
|
||||
- **--compress|-Z**: Compress all the result using gzip (default: false)
|
||||
|
||||
## Debug Options
|
||||
- **--debug**: Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
|
||||
- **--silent-warning**: Stop printing of the warning message (default: false, env: OBIWARNING)
|
||||
- **--no-progressbar**: Disable the progress bar printing (default: false)
|
||||
|
||||
## Profiling Options
|
||||
- **--pprof**: Enable pprof server. Look at the log for details. (default: false)
|
||||
- **--pprof-goroutine <int>**: Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
|
||||
- **--pprof-mutex <int>**: Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
|
||||
|
||||
## Utility Options
|
||||
- **--taxonomy|-t <string>**: Path to the taxonomy database. (default: "")
|
||||
- **--solexa**: Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
|
||||
- **--help|-h|-?**: Show help message (default: false)
|
||||
- **--version**: Prints the version and exits. (default: false)
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
## Convert FASTQ to FASTA
|
||||
```bash
|
||||
# Convert quality-score data from FASTQ to readable FASTA format
|
||||
obiconvert --fastq --fasta-output input.fastq -o output.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `output.fasta`.
|
||||
|
||||
## Convert FASTA to JSON
|
||||
```bash
|
||||
# Convert sequences to structured JSON format preserving all annotations
|
||||
obiconvert --fasta --json-output input.fasta -o output.json
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `output.json`.
|
||||
|
||||
## Process paired-end sequencing data
|
||||
```bash
|
||||
# Convert paired FASTQ files preserving read pairing
|
||||
obiconvert --fastq --fasta-output forward.fastq --paired-with reverse.fastq -o merged_sequences.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `merged_sequences_R1.fasta` and `merged_sequences_R2.fasta`.
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
- obiannotate: Add taxonomic and functional annotations to sequences
|
||||
- obicount: Count sequences in files
|
||||
- obigrep: Filter sequences based on attributes or patterns
|
||||
- obisummary: Generate statistics from sequence files
|
||||
- obiuniq: Remove duplicate sequences
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
obiconvert automatically selects the optimal output format based on input data presence, preferring FASTQ when quality scores are available and FASTA otherwise. To force a specific output format, use `--fasta-output`, `--fastq-output`, or `--json-output` explicitly. <!-- corrected: the output format is NOT determined by the output filename extension; it must be forced with explicit flags -->
|
||||
|
||||
Memory usage is controlled through batch processing, with configurable memory limits per batch to handle large datasets efficiently. Progress reporting can be disabled for scripting purposes using `--no-progressbar`.
|
||||
|
||||
When working with taxonomic data, ensure the taxonomy database is accessible and properly formatted to avoid failures during sequence annotation processing.
|
||||
@@ -0,0 +1,190 @@
|
||||
# NAME
|
||||
|
||||
obicount — counts the sequences present in a file of sequences
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obicount [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
|
||||
[--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq]
|
||||
[--genbank] [--help|-h|-?] [--input-OBI-header]
|
||||
[--input-json-header] [--max-cpu <int>] [--no-order] [--pprof]
|
||||
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--reads|-r]
|
||||
[--silent-warning] [--solexa] [--symbols|-s] [--u-to-t]
|
||||
[--variants|-v] [--version] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
obicount is a command-line tool designed to count biological sequences from various input formats. It helps biologists quickly obtain quantitative metrics about sequence collections, which is essential for quality control, data assessment, and pipeline monitoring. The tool can count reads (total sequences), variants (unique sequence strings), or symbols (sum of character lengths), providing flexibility to focus on specific aspects of sequence data depending on the analysis needs.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
obicount accepts input from files or stdin, supporting multiple biological sequence formats:
|
||||
- FASTA (.fasta[.gz])
|
||||
- FASTQ (.fastq[.fq][.gz])
|
||||
- GenBank/EMBL (.gb|.gbff|.dat[.gz])
|
||||
- ecoPCR format (.ecopcr[.gz])
|
||||
- CSV format (--csv flag)
|
||||
|
||||
Input can be provided as multiple filenames or read from stdin. The tool automatically detects file formats and parses sequences accordingly.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
obicount outputs one or more of the following metrics, depending on the flags used:
|
||||
|
||||
- **Read counts**: Total number of sequences in the input
|
||||
- **Variant counts**: Number of unique sequence strings (distinct sequences)
|
||||
- **Symbol counts**: Sum of all character lengths across all sequences
|
||||
|
||||
When no specific counting flags are provided (-r, -v, -s), all three metrics are reported by default. Output is printed to stdout in CSV format with headers: `entities,n` for the type of entity counted, followed by the count value.
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## General Options
|
||||
- --help|-h|-?
|
||||
Show help message and exit.
|
||||
|
||||
- --max-cpu <int>
|
||||
Number of parallel threads computing the result (default: 16, env: OBIMAXCPU).
|
||||
|
||||
- --debug
|
||||
Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
|
||||
|
||||
- --silent-warning
|
||||
Stop printing of the warning message (default: false, env: OBIWARNING)
|
||||
|
||||
## Input Format Options
|
||||
- --fasta
|
||||
Read data following the fasta format. (default: false)
|
||||
|
||||
- --fastq
|
||||
Read data following the fastq format. (default: false)
|
||||
|
||||
- --genbank
|
||||
Read data following the Genbank flatfile format. (default: false)
|
||||
|
||||
- --embl
|
||||
Read data following the EMBL flatfile format. (default: false)
|
||||
|
||||
- --ecopcr
|
||||
Read data following the ecoPCR output format. (default: false)
|
||||
|
||||
- --csv
|
||||
Read data following the CSV format. (default: false)
|
||||
|
||||
## Input Header Options
|
||||
- --input-OBI-header
|
||||
FASTA/FASTQ title line annotations follow OBI format. (default: false)
|
||||
|
||||
- --input-json-header
|
||||
FASTA/FASTQ title line annotations follow json format. (default: false)
|
||||
|
||||
## Counting Mode Options
|
||||
- --reads|-r
|
||||
Prints read counts. (default: false)
|
||||
|
||||
- --variants|-v
|
||||
Prints variant counts. (default: false)
|
||||
|
||||
- --symbols|-s
|
||||
Prints symbol counts. (default: false)
|
||||
|
||||
## Processing Options
|
||||
- --u-to-t
|
||||
Convert Uracil to Thymine. (default: false, env: OBISOLEXA)
|
||||
|
||||
- --solexa
|
||||
Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
|
||||
|
||||
- --no-order
|
||||
When several input files are provided, indicates that there is no order among them. (default: false)
|
||||
|
||||
## Performance Options
|
||||
- --batch-mem <string>
|
||||
Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
|
||||
|
||||
- --batch-size <int>
|
||||
Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
|
||||
|
||||
- --batch-size-max <int>
|
||||
Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
|
||||
|
||||
- --max-cpu <int>
|
||||
Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
|
||||
|
||||
## Profiling Options
|
||||
- --pprof
|
||||
Enable pprof server. Look at the log for details. (default: false)
|
||||
|
||||
- --pprof-goroutine <int>
|
||||
Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
|
||||
|
||||
- --pprof-mutex <int>
|
||||
Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
|
||||
|
||||
- --version
|
||||
Prints the version and exits. (default: false)
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
# Count total number of sequences in a FASTA file
|
||||
# Useful for quick assessment of dataset size
|
||||
obicount input.fasta
|
||||
**Expected output:** 4 sequences, out_default.txt
|
||||
|
||||
# Count only the number of unique sequence variants
|
||||
# Helpful for identifying genetic diversity in population data
|
||||
obicount --variants input.fasta
|
||||
**Expected output:** 4 sequences, out_variants.txt
|
||||
|
||||
# Count sum of all sequence symbol lengths (nucleotides/amino acids)
|
||||
# Useful for estimating total data volume or computing average read length
|
||||
obicount --symbols input.fasta
|
||||
**Expected output:** 4 sequences, out_symbols.txt
|
||||
|
||||
# Count reads from FASTQ format with quality scores
|
||||
# Essential for assessing read throughput in sequencing data
|
||||
obicount --fastq --reads input.fastq
|
||||
**Expected output:** 4 sequences, out_fastq_reads.txt
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
time="2026-04-02T19:33:11+02:00" level=info msg="Number of workers set 16"
|
||||
time="2026-04-02T19:33:11+02:00" level=info msg="Found 1 files to process"
|
||||
time="2026-04-02T19:33:11+02:00" level=info msg="input.fasta mime type: text/fasta"
|
||||
entities,n
|
||||
variants,5
|
||||
reads,5
|
||||
symbols,435
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
- obiconvert - Convert between biological sequence file formats
|
||||
- obiuniq - Remove duplicate sequences from files
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
_(not available)_
|
||||
@@ -0,0 +1,315 @@
|
||||
# NAME
|
||||
|
||||
obicsv — converts sequence files to CSV format
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
|
||||
[--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
|
||||
[--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
|
||||
[--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
|
||||
[--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
|
||||
[--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
|
||||
[--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
|
||||
[--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
|
||||
[--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
|
||||
[--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
obicsv converts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.
|
||||
|
||||
Columns must be explicitly selected: use `--ids` for the identifier, `--sequence` for the nucleotide sequence, `--quality` for quality scores, `--taxon` for taxonomic information, `--auto` to auto-detect annotation attributes, or `--keep` for specific named attributes. Multiple flags can be combined freely.
|
||||
|
||||
The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
obicsv accepts input from files or stdin. The input format is automatically detected based on the file extension, but can be explicitly specified using format flags.
|
||||
|
||||
Supported input formats:
|
||||
- FASTA (`--fasta`)
|
||||
- FASTQ (`--fastq`)
|
||||
- GenBank (`--genbank`)
|
||||
- EMBL (`--embl`)
|
||||
- ecoPCR output (`--ecopcr`)
|
||||
- CSV (`--csv`)
|
||||
|
||||
Input sources:
|
||||
- Local files (specified as arguments)
|
||||
- stdin (when no input file is provided)
|
||||
- Remote URLs (`http://`, `https://`, `ftp://`)
|
||||
- Directories (automatically scanned for valid files)
|
||||
|
||||
Header formats:
|
||||
- OBI format (`--input-OBI-header`)
|
||||
- JSON format (`--input-json-header`)
|
||||
- Auto-detection (default)
|
||||
|
||||
Taxonomy database can be provided with `--taxonomy|-t`.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
The output is a CSV file with one row per sequence. The columns included depend on the flags used:
|
||||
|
||||
| Column | Flag | Description |
|
||||
|--------|------|-------------|
|
||||
| id | `--ids\|-i` | Sequence identifier |
|
||||
| sequence | `--sequence\|-s` | DNA/RNA sequence |
|
||||
| qualities | `--quality\|-q` | Quality scores (ASCII-encoded) |
|
||||
| definition | `--definition\|-d` | Sequence description/annotation |
|
||||
| count | `--count` | Number of reads represented by this sequence |
|
||||
| taxid | `--taxon` | NCBI taxonomy identifier |
|
||||
| scientific_name | `--taxon` | Taxonomic scientific name |
|
||||
| custom attributes | `--keep\|-k` | Any attribute stored in sequence annotations |
|
||||
|
||||
If `--auto` is used, columns are automatically determined based on the attributes present in the first batch of sequences.
|
||||
|
||||
Missing values are written as the NA value (default: "NA").
|
||||
|
||||
## Observed output example
|
||||
|
||||
```csv
|
||||
id,sequence
|
||||
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
|
||||
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
|
||||
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Output Columns
|
||||
|
||||
These flags control which columns appear in the CSV output.
|
||||
|
||||
- **`--ids|-i`**
|
||||
- Default: `false`
|
||||
- Meaning: Include the sequence identifier column. Useful for tracking or linking sequences.
|
||||
|
||||
- **`--sequence|-s`**
|
||||
- Default: `false`
|
||||
- Meaning: Include the nucleotide or amino acid sequence. This is the main biological data.
|
||||
|
||||
- **`--quality|-q`**
|
||||
- Default: `false`
|
||||
- Meaning: Include quality scores for each position. Essential for quality control and filtering.
|
||||
|
||||
- **`--definition|-d`**
|
||||
- Default: `false`
|
||||
- Meaning: Include the sequence description or definition from the source file.
|
||||
|
||||
- **`--count`**
|
||||
- Default: `false`
|
||||
- Meaning: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
|
||||
|
||||
- **`--taxon`**
|
||||
- Default: `false`
|
||||
- Meaning: Include taxonomic information. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see `--taxonomy`).
|
||||
|
||||
- **`--obipairing`**
|
||||
- Default: `false`
|
||||
- Meaning: Include attributes that were added by the `obipairing` command (pairing scores, mismatches, etc.).
|
||||
|
||||
- **`--auto`**
|
||||
- Default: `false`
|
||||
- Meaning: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with `--ids`, `--sequence`, etc. to add those columns on top of the auto-detected ones.
|
||||
|
||||
- **`--keep|-k <KEY>`**
|
||||
- Default: `none`
|
||||
- Meaning: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations.
|
||||
|
||||
- **`--na-value <NAVALUE>`**
|
||||
- Default: `"NA"`
|
||||
- Meaning: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, "NA", "null").
|
||||
|
||||
## Input/Output Files
|
||||
|
||||
- **`--out|-o <FILENAME>`**
|
||||
- Default: `"-"` (stdout)
|
||||
- Meaning: Write output to the specified file instead of stdout.
|
||||
|
||||
- **`--compress|-Z`**
|
||||
- Default: `false`
|
||||
- Meaning: Compress the output using gzip.
|
||||
|
||||
## Input Format
|
||||
|
||||
- **`--fasta`**, **`--fastq`**, **`--genbank`**, **`--embl`**, **`--ecopcr`**, **`--csv`**
|
||||
- Default: auto-detection
|
||||
- Meaning: Explicitly specify the input format.
|
||||
|
||||
- **`--input-OBI-header`**, **`--input-json-header`**
|
||||
- Default: auto-detection
|
||||
- Meaning: Specify the header format in FASTA/FASTQ files (OBI or JSON annotations).
|
||||
|
||||
- **`--u-to-t`**
|
||||
- Default: `false`
|
||||
- Meaning: Convert Uracil to Thymine. Useful for RNA sequences.
|
||||
|
||||
- **`--solexa`**
|
||||
- Default: `false`
|
||||
- Meaning: Decode quality strings according to the Solexa specification instead of Phred.
|
||||
|
||||
## Taxonomy
|
||||
|
||||
- **`--taxonomy|-t <string>`**
|
||||
- Default: `""`
|
||||
- Meaning: Path to the taxonomy database directory. Required for `--taxon` output.
|
||||
|
||||
- **`--fail-on-taxonomy`**
|
||||
- Default: `false`
|
||||
- Meaning: Make OBITools fail if a used taxid is not currently valid.
|
||||
|
||||
- **`--update-taxid`**
|
||||
- Default: `false`
|
||||
- Meaning: Automatically update taxids that have been merged to their newest valid taxid.
|
||||
|
||||
- **`--raw-taxid`**
|
||||
- Default: `false`
|
||||
- Meaning: Print only taxids without supplementary information (name and rank).
|
||||
|
||||
- **`--with-leaves`**
|
||||
- Default: `false`
|
||||
- Meaning: Add sequences as leaves of their taxid annotation when taxonomy is extracted from a sequence file.
|
||||
|
||||
## Performance
|
||||
|
||||
- **`--max-cpu <int>`**
|
||||
- Default: `16`
|
||||
- Meaning: Number of parallel threads for processing.
|
||||
|
||||
- **`--batch-size <int>`**
|
||||
- Default: `1`
|
||||
- Meaning: Minimum number of sequences per batch.
|
||||
|
||||
- **`--batch-size-max <int>`**
|
||||
- Default: `2000`
|
||||
- Meaning: Maximum number of sequences per batch.
|
||||
|
||||
- **`--batch-mem <string>`**
|
||||
- Default: `"128M"`
|
||||
- Meaning: Maximum memory per batch (e.g., 128K, 64M, 1G).
|
||||
|
||||
- **`--no-order`**
|
||||
- Default: `false`
|
||||
- Meaning: When multiple input files are provided, indicates there is no order among them.
|
||||
|
||||
- **`--no-progressbar`**
|
||||
- Default: `false`
|
||||
- Meaning: Disable the progress bar.
|
||||
|
||||
## Other Options
|
||||
|
||||
- **`--debug`**
|
||||
- Default: `false`
|
||||
- Meaning: Enable debug mode by setting log level to debug.
|
||||
|
||||
- **`--pprof`**
|
||||
- Default: `false`
|
||||
- Meaning: Enable pprof server.
|
||||
|
||||
- **`--pprof-goroutine <int>`**
|
||||
- Default: `6060`
|
||||
- Meaning: Enable profiling of goroutine blocking.
|
||||
|
||||
- **`--pprof-mutex <int>`**
|
||||
- Default: `10`
|
||||
- Meaning: Enable profiling of mutex lock.
|
||||
|
||||
- **`--silent-warning`**
|
||||
- Default: `false`
|
||||
- Meaning: Suppress warning messages.
|
||||
|
||||
- **`--version`**
|
||||
- Default: `false`
|
||||
- Meaning: Print version information and exit.
|
||||
|
||||
- **`--help|-h|-?`**
|
||||
- Default: `false`
|
||||
- Meaning: Print help information.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
**Export sequences with identifiers to CSV**
|
||||
|
||||
Extracts sequence IDs and sequences from a FASTQ file.
|
||||
```bash
|
||||
obicsv --ids --sequence sequences.fastq -o output1.csv
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `output1.csv`.
|
||||
|
||||
**Export sequences with quality scores**
|
||||
|
||||
Useful for quality control and filtering in downstream tools.
|
||||
```bash
|
||||
obicsv --ids --sequence --quality sequences.fastq -o output2.csv
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `output2.csv`.
|
||||
|
||||
**Export with taxonomic information**
|
||||
|
||||
Includes taxid and scientific name for taxonomic analysis.
|
||||
```bash
|
||||
obicsv --ids --sequence --taxon --taxonomy /path/to/taxonomy sequences.fasta -o output.csv
|
||||
```
|
||||
|
||||
**Auto-detect annotation columns from sequence headers**
|
||||
|
||||
Automatically discovers all annotation attributes present in the sequence headers and outputs them as CSV columns. Combined with `--ids` to also include the sequence identifier.
|
||||
```bash
|
||||
obicsv --auto --ids sequences.fasta -o output4.csv
|
||||
```
|
||||
|
||||
**Expected output:** 3 rows in `output4.csv` with columns `id`, `sample`, `taxid` (attributes found in sequence headers).
|
||||
|
||||
**Extract specific attributes**
|
||||
|
||||
Keeps only the specified attributes as columns. Attributes not present in a sequence are written as the NA value.
|
||||
```bash
|
||||
obicsv --keep sample --keep taxid sequences.fasta -o output5.csv
|
||||
```
|
||||
|
||||
**Expected output:** 3 rows in `output5.csv` with columns `taxid`, `sample`.
|
||||
|
||||
**Export with compression**
|
||||
|
||||
Writes gzip-compressed CSV output for large datasets.
|
||||
```bash
|
||||
obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `output6.csv.gz`.
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
- `obiconvert` — input/output handling framework
|
||||
- `obipairing` — pairing information (used with `--obipairing`)
|
||||
- Other export commands: `obifasta`, `obifastq`, `obijson`
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- Without any column selection flag (`--ids`, `--sequence`, `--quality`, `--taxon`, `--auto`, `--keep`), the output contains no columns and no data.
|
||||
- The `--taxon` option requires a valid taxonomy database specified with `--taxonomy`.
|
||||
- Output is written to stdout by default; use `--out` to write to a file.
|
||||
- Missing attributes are written as the NA value (customizable with `--na-value`).
|
||||
- Input sequences are processed using streaming iterators to minimize memory footprint, even for large files.
|
||||
@@ -0,0 +1,321 @@
|
||||
# obidemerge
|
||||
|
||||
## NAME
|
||||
|
||||
`obidemerge` — split merged sequence records back into individual, sample-annotated copies
|
||||
|
||||
## SYNOPSIS
|
||||
|
||||
```
|
||||
obidemerge [options] [input_files...]
|
||||
```
|
||||
|
||||
## DESCRIPTION
|
||||
|
||||
In a typical metabarcoding workflow, `obiuniq` or similar tools collapse identical sequences
|
||||
from multiple samples into a single representative record. That record carries a statistics
|
||||
attribute (for example `merged_sample`) that stores, for every original sample, how many
|
||||
times the sequence was observed. This compact representation is convenient for clustering
|
||||
and denoising, but some downstream analyses need the original, per-sample view.
|
||||
|
||||
`obidemerge` reverses that merging step. For each input sequence, it reads the statistics
|
||||
stored under a chosen attribute (by default `sample`) and produces one output sequence per
|
||||
entry in that statistics map. Each output sequence is a copy of the original, but:
|
||||
|
||||
- its `sample` attribute (or whichever slot you chose) is set to the name of the individual
|
||||
sample,
|
||||
- its read count is set to the abundance recorded for that sample.
|
||||
|
||||
The original statistics attribute is removed from all output sequences.
|
||||
|
||||
Sequences that carry no statistics for the chosen slot are passed through unchanged.
|
||||
|
||||
The command reads sequences from one or more files, or from standard input when no file is
|
||||
given, and writes the results to standard output or to the file specified with `--out`.
|
||||
|
||||
## INPUT FORMATS
|
||||
|
||||
`obidemerge` accepts all sequence formats supported by OBITools4:
|
||||
|
||||
| Format | Description |
|
||||
|--------|-------------|
|
||||
| FASTA | Plain nucleotide sequences with annotation in the title line |
|
||||
| FASTQ | Sequences with per-base quality scores |
|
||||
| EMBL | European Nucleotide Archive flat-file format |
|
||||
| GenBank | NCBI GenBank flat-file format |
|
||||
| ecoPCR | Output produced by the ecoPCR tool |
|
||||
| CSV | Comma-separated values with sequence and metadata columns |
|
||||
|
||||
The format is detected automatically from the file extension or content. You can override
|
||||
detection with the format flags listed under **Input format options** below.
|
||||
|
||||
Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style
|
||||
(`--input-OBI-header`) or JSON style (`--input-json-header`).
|
||||
|
||||
## OUTPUT FORMATS
|
||||
|
||||
By default, the output format mirrors the input:
|
||||
|
||||
- If the input contains quality scores, output is FASTQ.
|
||||
- Otherwise, output is FASTA with OBI-style annotations.
|
||||
|
||||
You can force a specific format with `--fasta-output`, `--fastq-output`, or `--json-output`.
|
||||
|
||||
## OPTIONS
|
||||
|
||||
### Demerge option
|
||||
|
||||
`--demerge <slot>`, `-d <slot>`
|
||||
: Name of the sequence attribute that holds the per-sample statistics to expand.
|
||||
Each key in that statistics map becomes a separate output sequence.
|
||||
**Default:** `sample`
|
||||
|
||||
### Output options
|
||||
|
||||
`--out <FILENAME>`, `-o <FILENAME>`
|
||||
: Write output to this file instead of standard output. Use `-` for standard output.
|
||||
**Default:** `-` (standard output)
|
||||
|
||||
`--fasta-output`
|
||||
: Write output in FASTA format, even when quality scores are available.
|
||||
**Default:** false
|
||||
|
||||
`--fastq-output`
|
||||
: Write output in FASTQ format (requires quality scores in the input).
|
||||
**Default:** false
|
||||
|
||||
`--json-output`
|
||||
: Write output in JSON format, one record per line.
|
||||
**Default:** false
|
||||
|
||||
`--output-OBI-header`, `-O`
|
||||
: Write FASTA/FASTQ title lines in OBI key=value annotation style.
|
||||
**Default:** false (JSON-style headers)
|
||||
|
||||
`--output-json-header`
|
||||
: Write FASTA/FASTQ title lines in JSON annotation style.
|
||||
**Default:** false
|
||||
|
||||
`--compress`, `-Z`
|
||||
: Compress the output with gzip.
|
||||
**Default:** false
|
||||
|
||||
`--skip-empty`
|
||||
: Discard sequences of length zero from the output.
|
||||
**Default:** false
|
||||
|
||||
### Input format options
|
||||
|
||||
`--fasta`
|
||||
: Force reading in FASTA format.
|
||||
|
||||
`--fastq`
|
||||
: Force reading in FASTQ format.
|
||||
|
||||
`--embl`
|
||||
: Force reading in EMBL flat-file format.
|
||||
|
||||
`--genbank`
|
||||
: Force reading in GenBank flat-file format.
|
||||
|
||||
`--ecopcr`
|
||||
: Force reading in ecoPCR output format.
|
||||
|
||||
`--csv`
|
||||
: Force reading in CSV format.
|
||||
|
||||
`--input-OBI-header`
|
||||
: Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
|
||||
|
||||
`--input-json-header`
|
||||
: Parse FASTA/FASTQ title lines as JSON annotations.
|
||||
|
||||
`--solexa`
|
||||
: Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard
|
||||
Phred scale. Use this only for very old sequencing data.
|
||||
**Default:** false
|
||||
|
||||
`--u-to-t`
|
||||
: Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived
|
||||
data that should be treated as DNA.
|
||||
**Default:** false
|
||||
|
||||
`--no-order`
|
||||
: When reading from several input files, do not attempt to preserve the order of records
|
||||
across files. May improve speed when order does not matter.
|
||||
**Default:** false
|
||||
|
||||
### Taxonomy options
|
||||
|
||||
`--taxonomy <path>`, `-t <path>`
|
||||
: Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to
|
||||
be resolved or validated during output.
|
||||
**Default:** none
|
||||
|
||||
`--fail-on-taxonomy`
|
||||
: Stop with an error if a taxonomic identifier in the data is not found in the loaded
|
||||
taxonomy database.
|
||||
**Default:** false
|
||||
|
||||
`--raw-taxid`
|
||||
: Print taxonomic identifiers as plain numbers, without appending the taxon name and rank.
|
||||
**Default:** false
|
||||
|
||||
`--update-taxid`
|
||||
: Automatically replace deprecated taxonomic identifiers with their current equivalents,
|
||||
as declared in the taxonomy database.
|
||||
**Default:** false
|
||||
|
||||
`--with-leaves`
|
||||
: When a taxonomy is extracted from the sequence file itself, treat each sequence as a
|
||||
leaf node under its annotated taxonomic identifier.
|
||||
**Default:** false
|
||||
|
||||
### Performance options
|
||||
|
||||
`--max-cpu <int>`
|
||||
: Maximum number of parallel processing threads. Increase for faster processing on
|
||||
multi-core machines.
|
||||
**Default:** 16 (or the value of the `OBIMAXCPU` environment variable)
|
||||
|
||||
`--batch-size <int>`
|
||||
: Minimum number of sequences processed together as a group.
|
||||
**Default:** 1
|
||||
|
||||
`--batch-size-max <int>`
|
||||
: Maximum number of sequences processed together as a group.
|
||||
**Default:** 2000
|
||||
|
||||
`--batch-mem <size>`
|
||||
: Maximum memory used per processing group (e.g. `64M`, `1G`). Set to `0` to disable the
|
||||
memory limit and rely on `--batch-size-max` alone.
|
||||
**Default:** `128M`
|
||||
|
||||
### Display options
|
||||
|
||||
`--no-progressbar`
|
||||
: Hide the progress bar.
|
||||
**Default:** false
|
||||
|
||||
`--silent-warning`
|
||||
: Suppress warning messages.
|
||||
**Default:** false
|
||||
|
||||
`--debug`
|
||||
: Enable verbose debug logging.
|
||||
**Default:** false
|
||||
|
||||
`--version`
|
||||
: Print the OBITools4 version and exit.
|
||||
|
||||
`--help`, `-h`, `-?`
|
||||
: Print this help message and exit.
|
||||
|
||||
## EXAMPLES
|
||||
|
||||
### Example 1 — basic demerge using the default slot
|
||||
|
||||
After running `obiuniq`, the file `unique.fasta` contains merged sequences whose
|
||||
`merged_sample` attribute records abundance per sample. Demerge back to one
|
||||
sequence per sample:
|
||||
<!-- corrected: -d sample (not -d merged_sample) because HasStatsOn("sample") looks for the merged_sample attribute -->
|
||||
|
||||
```bash
|
||||
obidemerge -d sample unique.fasta > per_sample_merged.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 7 sequences written to `per_sample_merged.fasta`.
|
||||
|
||||
### Example 2 — demerge with the default `sample` slot
|
||||
|
||||
If the statistics are already stored under the attribute named `sample` (the default),
|
||||
no `-d` flag is needed:
|
||||
|
||||
```bash
|
||||
obidemerge unique.fasta > per_sample_default.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 7 sequences written to `per_sample_default.fasta`.
|
||||
|
||||
### Example 3 — write compressed output to a file
|
||||
|
||||
```bash
|
||||
obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 7 sequences written (compressed) to `per_sample.fasta.gz`.
|
||||
|
||||
### Example 4 — pipeline use: cluster, then demerge
|
||||
|
||||
Obtain unique sequences, cluster them, then expand the clusters back to individual
|
||||
sample records for ecological analysis:
|
||||
|
||||
```bash
|
||||
obiuniq -m sample reads.fastq \
|
||||
| obiclean ... \
|
||||
| obidemerge -d sample -o demerged.fasta
|
||||
```
|
||||
|
||||
### Example 5 — process multiple input files
|
||||
|
||||
```bash
|
||||
obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `combined_demerged.fasta`.
|
||||
|
||||
## SEE ALSO
|
||||
|
||||
`obiuniq(1)` — collapses identical sequences and records per-sample counts (the inverse operation)
|
||||
`obiclean(1)` — removes PCR/sequencing artefacts from a set of unique sequences
|
||||
`obiannotate(1)` — adds or modifies sequence attributes
|
||||
`obigrep(1)` — filters sequences by attributes or sequence content
|
||||
`obicount(1)` — counts sequences and total reads in a file
|
||||
|
||||
## NOTES
|
||||
|
||||
**Relationship to `obiuniq`.**
|
||||
`obiuniq --merge sample` stores per-sample counts under an attribute named `merged_sample`.
|
||||
When you later call `obidemerge`, you must therefore pass `-d sample` to match that
|
||||
attribute name. The `-d` option takes the **logical** slot name (here `sample`), not the
|
||||
internal storage name (`merged_sample`).
|
||||
<!-- corrected: -d sample is correct (not -d merged_sample); the tool prepends "merged_" internally when looking up the attribute -->
|
||||
|
||||
**Read counts after demerging.**
|
||||
Each output sequence has its read count set to the value recorded in the statistics map for
|
||||
that sample. If you sum the counts of all output sequences that share the same identifier,
|
||||
you recover the total count of the original merged record.
|
||||
|
||||
**Order of output sequences.**
|
||||
The order in which the per-sample copies of a single merged sequence appear in the output
|
||||
is not guaranteed. If a stable order is required, pipe the output through `obisort`.
|
||||
|
||||
## OUTPUT
|
||||
|
||||
`obidemerge` writes one sequence record per sample entry found in the statistics attribute.
|
||||
Each output record is a copy of the input sequence, with:
|
||||
|
||||
- the statistics attribute (`merged_<slot>`) removed,
|
||||
- the `<slot>` attribute set to the sample name,
|
||||
- the `count` attribute set to the abundance for that sample.
|
||||
|
||||
Sequences with no statistics for the chosen slot are passed through unchanged.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"count":5,"sample":"sampleA"}
|
||||
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
|
||||
>seq001 {"count":3,"sample":"sampleB"}
|
||||
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
|
||||
>seq001 {"count":1,"sample":"sampleC"}
|
||||
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
|
||||
>seq002 {"count":2,"sample":"sampleA"}
|
||||
ttggccaattggccaattggccaattggccaattggccaa
|
||||
>seq002 {"count":7,"sample":"sampleD"}
|
||||
ttggccaattggccaattggccaattggccaattggccaa
|
||||
>seq003 {"count":4,"sample":"sampleB"}
|
||||
gctagctagctagctagctagctagctagctagctagcta
|
||||
>seq004 {"count":6}
|
||||
aaaaccccggggttttaaaaccccggggttttaaaacccc
|
||||
```
|
||||
@@ -0,0 +1,296 @@
|
||||
# NAME
|
||||
|
||||
obidistribute — divided an input set of sequences into subsets
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obidistribute --pattern|-p <string> [--append|-A] [--batch-mem <string>]
|
||||
[--batch-size <int>] [--batch-size-max <int>]
|
||||
[--batches|-n <int>] [--classifier|-c <string>] [--compress|-Z]
|
||||
[--csv] [--debug] [--directory|-d <string>] [--ecopcr] [--embl]
|
||||
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
|
||||
[--genbank] [--hash|-H <int>] [--help|-h|-?]
|
||||
[--input-OBI-header] [--input-json-header] [--json-output]
|
||||
[--max-cpu <int>] [--na-value <string>] [--no-order]
|
||||
[--no-progressbar] [--out|-o <FILENAME>]
|
||||
[--output-OBI-header|-O] [--output-json-header] [--pprof]
|
||||
[--pprof-goroutine <int>] [--pprof-mutex <int>]
|
||||
[--silent-warning] [--skip-empty] [--solexa] [--u-to-t]
|
||||
[--version] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obidistribute` splits a set of biological sequences into multiple output files according to one of three distribution strategies: annotation-based classification, round-robin batch assignment, or hash-based sharding.
|
||||
|
||||
The most common use case in metabarcoding is demultiplexing: sequences carry a tag annotation (e.g., `sample_id`) and `obidistribute` writes each sample's sequences into its own file. The output filename for each group is built from a user-supplied pattern containing `%s`, which is replaced by the classifier value or batch index.
|
||||
|
||||
When no classifier is specified, sequences can be split into a fixed number of batches (`--batches`) for parallel downstream processing, or sharded deterministically by hash (`--hash`) to ensure reproducible partitioning regardless of input order.
|
||||
|
||||
Output files can be organised into subdirectories (one per classifier value) using `--directory`, and existing files can be extended rather than overwritten with `--append`. Sequences lacking the classifier annotation are assigned to a file whose name uses the NA value (default: `"NA"`).
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obidistribute` reads biological sequences from one or more files supplied as positional arguments, or from standard input when no files are given. All major NGS and flat-file formats are supported and auto-detected:
|
||||
|
||||
- FASTA / FASTQ (plain or gzip-compressed)
|
||||
- GenBank and EMBL flat files
|
||||
- ecoPCR output
|
||||
- CSV
|
||||
|
||||
Format can be forced with `--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, or `--csv`. Header annotation style can be specified with `--input-OBI-header` or `--input-json-header`.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
Each distribution group produces a separate output file named according to the `--pattern` template. The `%s` placeholder in the pattern is replaced by the classifier value, batch index, or hash shard index, depending on the chosen distribution mode.
|
||||
|
||||
Output format follows the same rules as other OBITools commands: FASTQ is used when quality scores are present, FASTA otherwise. The format can be forced with `--fasta-output`, `--fastq-output`, or `--json-output`. All annotations present in the input sequences are preserved in the output files.
|
||||
|
||||
When `--directory` is used together with `--classifier`, output files are placed in subdirectories named after the classifier values, allowing hierarchical organisation of results.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
@seq001 {"sample_id":"sampleA"}
|
||||
atcgatcgatcgatcgatcg
|
||||
+
|
||||
IIIIIIIIIIIIIIIIIIII
|
||||
@seq002 {"sample_id":"sampleA"}
|
||||
gctagctagctagctagcta
|
||||
+
|
||||
IIIIIIIIIIIIIIIIIIII
|
||||
@seq003 {"sample_id":"sampleA"}
|
||||
ttagctaatcggtaatcggt
|
||||
+
|
||||
IIIIIIIIIIIIIIIIIIII
|
||||
@seq009 {"sample_id":"sampleA"}
|
||||
atgatgatgatgatgatgat
|
||||
+
|
||||
IIIIIIIIIIIIIIIIIIII
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Distribution mode
|
||||
|
||||
- **`--pattern|-p <string>`** — _(required)_
|
||||
Default: none.
|
||||
The template used to build output filenames. The variable part is represented by `%s`. Example: `toto_%s.fastq`.
|
||||
|
||||
- **`--classifier|-c <string>`**
|
||||
Default: `""`.
|
||||
The name of an annotation tag on the sequences. Sequences are dispatched into separate files based on the value of this tag. The tag value must be a string, integer, or boolean.
|
||||
|
||||
- **`--batches|-n <int>`**
|
||||
Default: `0`.
|
||||
Splits the input into exactly *N* batches by round-robin assignment, regardless of sequence metadata.
|
||||
|
||||
- **`--hash|-H <int>`**
|
||||
Default: `0`.
|
||||
Splits the input into at most *N* batches using a hash of the sequence. Produces deterministic, reproducible sharding.
|
||||
|
||||
- **`--directory|-d <string>`**
|
||||
Default: `""`.
|
||||
Used together with `--classifier`: organises output files into subdirectories named after classifier values.
|
||||
|
||||
## Output file handling
|
||||
|
||||
- **`--append|-A`**
|
||||
Default: `false`.
|
||||
Appends sequences to output files if they already exist, instead of overwriting them.
|
||||
|
||||
- **`--na-value <string>`**
|
||||
Default: `"NA"`.
|
||||
Value used as the filename component when a sequence does not have the classifier tag defined.
|
||||
|
||||
- **`--compress|-Z`**
|
||||
Default: `false`.
|
||||
Compresses all output files using gzip.
|
||||
|
||||
## Input format
|
||||
|
||||
- **`--fasta`**
|
||||
Default: `false`.
|
||||
Read data following the FASTA format.
|
||||
|
||||
- **`--fastq`**
|
||||
Default: `false`.
|
||||
Read data following the FASTQ format.
|
||||
|
||||
- **`--embl`**
|
||||
Default: `false`.
|
||||
Read data following the EMBL flatfile format.
|
||||
|
||||
- **`--genbank`**
|
||||
Default: `false`.
|
||||
Read data following the GenBank flatfile format.
|
||||
|
||||
- **`--ecopcr`**
|
||||
Default: `false`.
|
||||
Read data following the ecoPCR output format.
|
||||
|
||||
- **`--csv`**
|
||||
Default: `false`.
|
||||
Read data following the CSV format.
|
||||
|
||||
- **`--input-OBI-header`**
|
||||
Default: `false`.
|
||||
FASTA/FASTQ title line annotations follow OBI format.
|
||||
|
||||
- **`--input-json-header`**
|
||||
Default: `false`.
|
||||
FASTA/FASTQ title line annotations follow JSON format.
|
||||
|
||||
- **`--solexa`**
|
||||
Default: `false`.
|
||||
Decodes quality string according to the Solexa specification.
|
||||
|
||||
- **`--u-to-t`**
|
||||
Default: `false`.
|
||||
Convert Uracil to Thymine.
|
||||
|
||||
- **`--skip-empty`**
|
||||
Default: `false`.
|
||||
Sequences of length equal to zero are suppressed from the output.
|
||||
|
||||
- **`--no-order`**
|
||||
Default: `false`.
|
||||
When several input files are provided, indicates that there is no order among them.
|
||||
|
||||
## Output format
|
||||
|
||||
- **`--fasta-output`**
|
||||
Default: `false`.
|
||||
Write sequences in FASTA format (default if no quality data available).
|
||||
|
||||
- **`--fastq-output`**
|
||||
Default: `false`.
|
||||
Write sequences in FASTQ format (default if quality data available).
|
||||
|
||||
- **`--json-output`**
|
||||
Default: `false`.
|
||||
Write sequences in JSON format.
|
||||
|
||||
- **`--output-OBI-header|-O`**
|
||||
Default: `false`.
|
||||
Output FASTA/FASTQ title line annotations follow OBI format.
|
||||
|
||||
- **`--output-json-header`**
|
||||
Default: `false`.
|
||||
Output FASTA/FASTQ title line annotations follow JSON format.
|
||||
|
||||
- **`--out|-o <FILENAME>`**
|
||||
Default: `"-"`.
|
||||
Filename used for saving the output.
|
||||
|
||||
## Performance
|
||||
|
||||
- **`--max-cpu <int>`**
|
||||
Default: `16`.
|
||||
Number of parallel threads computing the result.
|
||||
|
||||
- **`--batch-size <int>`**
|
||||
Default: `1`.
|
||||
Minimum number of sequences per batch.
|
||||
|
||||
- **`--batch-size-max <int>`**
|
||||
Default: `2000`.
|
||||
Maximum number of sequences per batch.
|
||||
|
||||
- **`--batch-mem <string>`**
|
||||
Default: `""` (128M).
|
||||
Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
|
||||
|
||||
## Diagnostic & debug
|
||||
|
||||
- **`--debug`**
|
||||
Default: `false`.
|
||||
Enable debug mode, by setting log level to debug.
|
||||
|
||||
- **`--no-progressbar`**
|
||||
Default: `false`.
|
||||
Disable the progress bar printing.
|
||||
|
||||
- **`--silent-warning`**
|
||||
Default: `false`.
|
||||
Stop printing of warning messages.
|
||||
|
||||
- **`--pprof`**
|
||||
Default: `false`.
|
||||
Enable pprof server. Look at the log for details.
|
||||
|
||||
- **`--pprof-goroutine <int>`**
|
||||
Default: `6060`.
|
||||
Enable profiling of goroutine blocking profile.
|
||||
|
||||
- **`--pprof-mutex <int>`**
|
||||
Default: `10`.
|
||||
Enable profiling of mutex lock.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Demultiplex sequences by sample_id annotation into per-sample FASTQ files
|
||||
obidistribute --classifier sample_id --pattern out_ex1_%s.fastq --no-progressbar --input-json-header reads.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 10 sequences written to 4 files: `out_ex1_sampleA.fastq` (4 sequences), `out_ex1_sampleB.fastq` (3 sequences), `out_ex1_sampleC.fastq` (2 sequences), `out_ex1_NA.fastq` (1 sequence).
|
||||
|
||||
```bash
|
||||
# Demultiplex into subdirectories, one directory per sample
|
||||
obidistribute --classifier sample_id --directory --pattern %s/reads.fastq reads.fastq
|
||||
```
|
||||
|
||||
```bash
|
||||
# Split a large dataset into 3 equal batches for parallel processing
|
||||
obidistribute --batches 3 --pattern chunk_%s.fasta --fasta-output --no-progressbar sequences.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 10 sequences written to 3 files: `chunk_1.fasta` (4 sequences), `chunk_2.fasta` (3 sequences), `chunk_3.fasta` (3 sequences). Batch indices are 1-based.
|
||||
|
||||
```bash
|
||||
# Hash-based sharding into 4 reproducible shards
|
||||
obidistribute --hash 4 --pattern shard_%s.fastq --no-progressbar reads.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 10 sequences written to 4 files: `shard_0.fastq` through `shard_3.fastq`. Shard indices are 0-based.
|
||||
|
||||
```bash
|
||||
# Append new sequences to existing per-sample files (incremental demultiplexing)
|
||||
obidistribute --classifier sample_id --pattern samples_%s.fastq --append new_reads.fastq
|
||||
```
|
||||
|
||||
```bash
|
||||
# Demultiplex sequences, replacing the NA label for unclassified sequences
|
||||
obidistribute --classifier sample_id --na-value unclassified --pattern out_ex6_%s.fastq --no-progressbar --input-json-header reads.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 10 sequences written to 4 files including `out_ex6_unclassified.fastq` (1 sequence without `sample_id` annotation).
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
`obiconvert`, `obisplit`, `obigrep`
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- Sequences that lack the annotation specified by `--classifier` are written to the file whose name is built using the `--na-value` (default: `"NA"`).
|
||||
- The three distribution modes (`--classifier`, `--batches`, `--hash`) are mutually exclusive.
|
||||
- When using `--directory` together with `--classifier`, subdirectories are created automatically if they do not exist.
|
||||
- Batch indices produced by `--batches` are 1-based; hash shard indices produced by `--hash` are 0-based.
|
||||
@@ -0,0 +1,326 @@
|
||||
# obigrep(1) — OBITools4 Manual
|
||||
|
||||
## NAME
|
||||
|
||||
`obigrep` — select a subset of sequence records on various criteria
|
||||
|
||||
## SYNOPSIS
|
||||
|
||||
```
|
||||
obigrep [OPTIONS] [FILE...]
|
||||
```
|
||||
|
||||
## DESCRIPTION
|
||||
|
||||
`obigrep` filters a set of biological sequence records (in FASTA or FASTQ format) and writes only those matching all specified criteria to the output. Its name is modelled on the Unix `grep` command, but instead of filtering lines in a text file, it filters sequence records.
|
||||
|
||||
Filtering criteria can be combined freely: only sequence records satisfying **all** specified conditions are retained. The selection can be inverted with `--inverse-match` to keep the records that would otherwise be discarded.
|
||||
|
||||
Sequences are read from one or more files, or from standard input if no file is given. Results are written to standard output or to a file specified with `--out`. Records that do not pass the filters can optionally be saved to a separate file with `--save-discarded`.
|
||||
|
||||
## INPUT FORMATS
|
||||
|
||||
`obigrep` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
|
||||
|
||||
| Flag | Format |
|
||||
|------|--------|
|
||||
| `--fasta` | FASTA |
|
||||
| `--fastq` | FASTQ |
|
||||
| `--embl` | EMBL flat file |
|
||||
| `--genbank` | GenBank flat file |
|
||||
| `--ecopcr` | ecoPCR output |
|
||||
| `--csv` | CSV tabular format |
|
||||
|
||||
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
|
||||
|
||||
## OUTPUT FORMATS
|
||||
|
||||
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
|
||||
|
||||
- `--fasta-output` — write FASTA
|
||||
- `--fastq-output` — write FASTQ
|
||||
- `--json-output` — write JSON
|
||||
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
|
||||
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
|
||||
- `--compress` / `-Z` — compress output with gzip
|
||||
|
||||
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
|
||||
|
||||
## FILTERING OPTIONS
|
||||
|
||||
### By sequence length
|
||||
|
||||
- `--min-length LENGTH` / `-l LENGTH`
|
||||
Keep only sequences at least *LENGTH* bases long.
|
||||
|
||||
- `--max-length LENGTH` / `-L LENGTH`
|
||||
Keep only sequences at most *LENGTH* bases long.
|
||||
|
||||
### By read abundance
|
||||
|
||||
Sequence records can carry a `count` attribute recording how many times the sequence was observed. The following options filter on that count:
|
||||
|
||||
- `--min-count COUNT` / `-c COUNT`
|
||||
Keep only sequences observed at least *COUNT* times (default: 1).
|
||||
|
||||
- `--max-count COUNT` / `-C COUNT`
|
||||
Keep only sequences observed at most *COUNT* times.
|
||||
|
||||
### By sequence pattern
|
||||
|
||||
- `--sequence PATTERN` / `-s PATTERN`
|
||||
Keep records whose nucleotide sequence matches the regular expression *PATTERN* (case-insensitive). This option can be repeated; all patterns must match.
|
||||
|
||||
- `--approx-pattern PATTERN`
|
||||
Keep records whose sequence contains an approximate match to *PATTERN*. The number of allowed differences is controlled by `--pattern-error`. This option can be repeated.
|
||||
|
||||
- `--pattern-error N`
|
||||
Maximum number of mismatches (or indels, if `--allows-indels` is set) tolerated when using `--approx-pattern` (default: 0, i.e. exact match).
|
||||
|
||||
- `--allows-indels`
|
||||
Allow insertions and deletions (in addition to substitutions) when performing approximate pattern matching.
|
||||
|
||||
- `--only-forward`
|
||||
Search patterns on the forward strand only. By default both strands are searched.
|
||||
|
||||
### By identifier or definition
|
||||
|
||||
- `--identifier PATTERN` / `-I PATTERN`
|
||||
Keep records whose identifier matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
|
||||
|
||||
- `--id-list FILENAME`
|
||||
Keep only records whose identifier appears in *FILENAME*, a plain-text file with one identifier per line.
|
||||
|
||||
- `--definition PATTERN` / `-D PATTERN`
|
||||
Keep records whose definition line matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
|
||||
|
||||
### By attribute (metadata)
|
||||
|
||||
Sequence records can carry arbitrary key/value annotations:
|
||||
|
||||
- `--has-attribute KEY` / `-A KEY`
|
||||
Keep records that possess an attribute named *KEY*, regardless of its value. Can be repeated.
|
||||
|
||||
- `--attribute KEY=PATTERN` / `-a KEY=PATTERN`
|
||||
Keep records for which the value of attribute *KEY* matches the regular expression *PATTERN* (case-sensitive). Can be repeated; all constraints must be satisfied.
|
||||
|
||||
### By custom boolean expression
|
||||
|
||||
- `--predicate EXPRESSION` / `-p EXPRESSION`
|
||||
Keep records for which the boolean expression *EXPRESSION* evaluates to true. Attributes are accessed via the `annotations` map (e.g. `annotations["count"]`). The special variable `sequence` refers to the sequence object; its length can be obtained with `len(sequence)`. Can be repeated; all expressions must be true.
|
||||
|
||||
Example: `-p 'annotations["count"] >= 10 && len(sequence) < 200'`
|
||||
|
||||
### By taxonomy
|
||||
|
||||
Taxonomy-based filtering requires a taxonomy database to be provided with `--taxonomy`.
|
||||
|
||||
- `--taxonomy PATH` / `-t PATH`
|
||||
Path to the taxonomy database.
|
||||
|
||||
- `--restrict-to-taxon TAXID` / `-r TAXID`
|
||||
Keep only records whose taxon belongs to the lineage of *TAXID* (i.e. is *TAXID* itself or a descendant). Can be repeated; sequences must satisfy at least one of the provided taxids.
|
||||
|
||||
- `--ignore-taxon TAXID` / `-i TAXID`
|
||||
Discard records whose taxon belongs to the lineage of *TAXID*. Can be repeated.
|
||||
|
||||
- `--valid-taxid`
|
||||
Keep only records that carry a valid, recognised taxonomic identifier.
|
||||
|
||||
- `--require-rank RANK_NAME`
|
||||
Keep only records whose taxon has a defined ancestor at the given rank (e.g. *species*, *genus*, *family*). Can be repeated.
|
||||
|
||||
- `--update-taxid`
|
||||
Automatically update merged taxids to their current valid equivalent.
|
||||
|
||||
- `--fail-on-taxonomy`
|
||||
Exit with an error if a taxid referenced in the data is not valid.
|
||||
|
||||
- `--with-leaves`
|
||||
When the taxonomy is extracted from a sequence file, attach each sequence as a leaf node under its annotated taxid.
|
||||
|
||||
- `--raw-taxid`
|
||||
Print taxids in output files without supplementary information (taxon name and rank).
|
||||
|
||||
### Inversion
|
||||
|
||||
- `--inverse-match` / `-v`
|
||||
Invert the selection: output the records that would otherwise be discarded.
|
||||
|
||||
## PAIRED-END OPTIONS
|
||||
|
||||
When paired-end sequencing data are provided (forward and reverse reads stored in two files), `obigrep` can apply filters taking both reads into account.
|
||||
|
||||
- `--paired-with FILENAME`
|
||||
File containing the reverse (paired) reads.
|
||||
|
||||
- `--paired-mode MODE`
|
||||
How to combine the filter result from the forward and reverse reads. *MODE* is one of:
|
||||
|
||||
| Mode | Meaning |
|
||||
|------|---------|
|
||||
| `forward` | Keep the pair if the **forward** read passes (default) |
|
||||
| `reverse` | Keep the pair if the **reverse** read passes |
|
||||
| `and` | Keep the pair if **both** reads pass |
|
||||
| `or` | Keep the pair if **at least one** read passes |
|
||||
| `andnot` | Keep the pair if the **forward** passes and the **reverse** does not |
|
||||
| `xor` | Keep the pair if **exactly one** read passes |
|
||||
|
||||
## OUTPUT CONTROL
|
||||
|
||||
- `--save-discarded FILENAME`
|
||||
Write sequence records that do **not** pass the filters to *FILENAME*.
|
||||
|
||||
- `--out FILENAME` / `-o FILENAME`
|
||||
Write the selected records to *FILENAME* (default: standard output).
|
||||
|
||||
- `--skip-empty`
|
||||
Suppress sequences of length zero from the output.
|
||||
|
||||
## PERFORMANCE OPTIONS
|
||||
|
||||
- `--max-cpu N`
|
||||
Number of parallel processing threads (default: number of available CPUs).
|
||||
|
||||
- `--batch-size N`
|
||||
Minimum number of sequences per processing batch (default: 1).
|
||||
|
||||
- `--batch-size-max N`
|
||||
Maximum number of sequences per processing batch (default: 2000).
|
||||
|
||||
- `--batch-mem SIZE`
|
||||
Maximum memory per batch (e.g. `128M`, `1G`). Overrides `--batch-size-max` when memory is the limiting factor. Can also be set via the environment variable `OBIBATCHMEM`.
|
||||
|
||||
- `--no-order`
|
||||
When multiple input files are provided, indicates that no ordering is assumed between them, which can improve throughput.
|
||||
|
||||
- `--no-progressbar`
|
||||
Disable the progress bar.
|
||||
|
||||
## MISCELLANEOUS OPTIONS
|
||||
|
||||
- `--u-to-t`
|
||||
Convert uracil (U) to thymine (T) in all sequences (useful for RNA data).
|
||||
|
||||
- `--solexa`
|
||||
Decode quality scores according to the legacy Solexa specification instead of the standard Phred encoding.
|
||||
|
||||
- `--silent-warning`
|
||||
Suppress warning messages.
|
||||
|
||||
- `--debug`
|
||||
Enable verbose debug logging.
|
||||
|
||||
- `--version`
|
||||
Print version information and exit.
|
||||
|
||||
- `--help` / `-h` / `-?`
|
||||
Display the help message and exit.
|
||||
|
||||
## EXAMPLES
|
||||
|
||||
Keep all sequences longer than 100 bases:
|
||||
|
||||
```bash
|
||||
obigrep --min-length 100 input.fasta > out_min_length.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_min_length.fasta`.
|
||||
|
||||
Select sequences observed at least 10 times:
|
||||
|
||||
```bash
|
||||
obigrep --min-count 10 input.fasta > out_min_count.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_min_count.fasta`.
|
||||
|
||||
Keep sequences whose identifier starts with `BOLD`:
|
||||
|
||||
```bash
|
||||
obigrep --identifier '^BOLD' input.fasta > out_bold.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 2 sequences written to `out_bold.fasta`.
|
||||
|
||||
Select only sequences carrying the IUPAC primer motif `GGGCWATGTTTCATAAYGGG` with up to 2 mismatches:
|
||||
|
||||
```bash
|
||||
obigrep --approx-pattern GGGCWATGTTTCATAAYGGG --pattern-error 2 input.fasta > out_primer.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 2 sequences written to `out_primer.fasta`.
|
||||
|
||||
Retain sequences belonging to the genus *Homo* (taxid 9605) in an NCBI taxonomy:
|
||||
|
||||
```bash
|
||||
obigrep --taxonomy /data/ncbi_tax --restrict-to-taxon 9605 input.fasta
|
||||
```
|
||||
|
||||
Keep sequences that have a `sample` attribute equal to `lake1` and save the rest to a separate file:
|
||||
|
||||
```bash
|
||||
obigrep --attribute sample='^lake1$' --save-discarded discarded.fasta \
|
||||
input.fasta > lake1.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `lake1.fasta`, 5 sequences written to `discarded.fasta`.
|
||||
|
||||
Invert a length filter (discard sequences shorter than 50 bases):
|
||||
|
||||
```bash
|
||||
obigrep --min-length 50 --inverse-match input.fasta > out_short.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 1 sequence written to `out_short.fasta`.
|
||||
|
||||
Apply a custom predicate (sequences with count ≥ 5):
|
||||
|
||||
```bash
|
||||
obigrep -p 'annotations["count"] >= 5' input.fasta > out_predicate.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_predicate.fasta`.
|
||||
|
||||
## OUTPUT
|
||||
|
||||
### Attribute table
|
||||
|
||||
Attributes present on sequence records are preserved unchanged in the output. No new attributes are added by `obigrep` itself — only filtering occurs.
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `count` | integer | Number of times the sequence was observed (read from input) |
|
||||
| `sample` | string | Sample identifier (read from input) |
|
||||
|
||||
Any other annotations present in the input are carried through to the output unmodified.
|
||||
|
||||
### Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"count":15,"sample":"lake1"}
|
||||
acgtacgtacgtacgtacgtgggcaatgtttcataatgggacgtacgtacgtacgtacgt
|
||||
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
|
||||
acgtacgtacgtacgtacgtacgtacgtacgt
|
||||
>seq002 {"count":3,"sample":"lake1"}
|
||||
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
|
||||
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
|
||||
>seq004 {"count":2,"sample":"lake1"}
|
||||
aaacccgggtttagctagctagctagctagctagctagctagctagctagctagctagct
|
||||
agctagctagctagctagctagctagctagctagctagctagctagctagctagctagct
|
||||
atacgtatcgatcg
|
||||
>BOLD_005 {"count":8,"sample":"pond1"}
|
||||
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
|
||||
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
|
||||
>seq008 {"count":7,"sample":"river2"}
|
||||
ttacgatcgatcgatcgatcgggcaatgtttcataaggggacgatcgatcgatcgatcga
|
||||
tcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
|
||||
```
|
||||
|
||||
## SEE ALSO
|
||||
|
||||
`obiannotate`(1), `obiuniq`(1), `obiconvert`(1), `obitag`(1), `obisplit`(1)
|
||||
|
||||
## OBITools4
|
||||
|
||||
`obigrep` is part of the **OBITools4** suite for analysing DNA metabarcoding and environmental DNA data.
|
||||
@@ -0,0 +1,257 @@
|
||||
# NAME
|
||||
|
||||
obijoin — merge annotations contained in a file to another file
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obijoin --join-with|-j <string> [--batch-mem <string>] [--batch-size <int>]
|
||||
[--batch-size-max <int>] [--by|-b <string>]... [--compress|-Z]
|
||||
[--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
|
||||
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
|
||||
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
|
||||
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
|
||||
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
|
||||
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
|
||||
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
|
||||
[--taxonomy|-t <string>] [--u-to-t] [--update-id|-i]
|
||||
[--update-quality|-q] [--update-sequence|-s] [--update-taxid]
|
||||
[--version] [--with-leaves] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obijoin` merges annotations from a secondary file into a primary sequence dataset. For each sequence in the primary input, it looks up matching records in the secondary file based on one or more shared attribute keys, then copies all annotations from matched partner records onto the primary sequence.
|
||||
|
||||
The join is a **left outer join**: every sequence in the primary dataset is preserved in the output, whether or not a match is found in the secondary file. Unmatched sequences simply receive no additional annotations. Key matching is exact string equality.
|
||||
|
||||
A common use case is enriching amplicon or read sequences with external sample metadata. The secondary file (the *annotation source*) can be a FASTA/FASTQ sequence file, a CSV table, an EMBL or GenBank flat file, or any other format accepted by OBITools4. This makes it straightforward to prepare a simple spreadsheet with sample identifiers and metadata columns, save it as CSV, and merge it directly into a sequence dataset — the CSV format is auto-detected, no format conversion or extra flag is needed. <!-- corrected: secondary CSV is auto-detected; --csv flag is not needed for the secondary file -->
|
||||
|
||||
In addition to transferring annotations, `obijoin` can optionally replace the sequence identifier, nucleotide sequence, or quality scores of each primary sequence with values from its matched partner, controlled by the `--update-id`, `--update-sequence`, and `--update-quality` flags.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obijoin` accepts a primary sequence dataset on standard input or as one or more file arguments. The supported formats are automatically detected and include FASTA, FASTQ, EMBL, GenBank, ecoPCR output, CSV, and JSON. Format-specific flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) can force a specific parser when auto-detection is ambiguous.
|
||||
|
||||
The secondary file, supplied via `--join-with`, is loaded entirely into memory before processing begins, and supports the same set of formats including CSV — the format is auto-detected automatically. <!-- corrected: removed incorrect claim that --csv is needed for secondary file -->
|
||||
|
||||
When multiple primary input files are provided and their ordering across files is irrelevant, `--no-order` allows the reader to return batches in whichever order they complete, improving throughput.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
The output is a sequence file in FASTA or FASTQ format (determined automatically by the presence of quality data), written to standard output or to the file specified by `--out`. Alternative output formats can be requested with `--fasta-output`, `--fastq-output`, or `--json-output`. The output can be gzip-compressed with `--compress`.
|
||||
|
||||
Each output sequence carries all annotations from the primary dataset, enriched with every annotation attribute copied from the matched partner record. If a field name exists in both, the partner value overwrites the primary value. When `--update-id`, `--update-sequence`, or `--update-quality` are set, the corresponding sequence-level fields are also replaced with the partner's values.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
|
||||
atgcatgcatgcatgcatgc
|
||||
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
|
||||
gctagctagctagctagcta
|
||||
>seq003 {"barcode":"TTTT","sample":"S3"}
|
||||
tttttttttttttttttttt
|
||||
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
|
||||
aaaaatttttcccccggggg
|
||||
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
|
||||
gggggaaaaatttttccccc
|
||||
>seq006 {"barcode":"AAAA","sample":"S4"}
|
||||
ccccccgggggtttttaaaaa
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Required
|
||||
|
||||
`--join-with|-j <string>`
|
||||
: Path to the secondary file whose records are joined onto the primary sequences. This parameter is mandatory. The file can be in any format accepted by OBITools4 (FASTA, FASTQ, CSV, EMBL, GenBank, ecoPCR); the format is auto-detected. Default: none.
|
||||
|
||||
## Join control
|
||||
|
||||
`--by|-b <string>`
|
||||
: Declares a join key as an attribute name or a `primary_attr=secondary_attr` mapping. Repeat the flag to join on multiple keys simultaneously; all keys must match for a record pair to be considered a hit (intersection semantics). When omitted, the join defaults to matching by sequence identifier (`id`). Default: `[]`.
|
||||
|
||||
`--update-id|-i`
|
||||
: Replace the identifier of each primary sequence with the identifier from its matched partner record. Default: `false`.
|
||||
|
||||
`--update-sequence|-s`
|
||||
: Replace the nucleotide or amino acid sequence of each primary sequence with the sequence from its matched partner. Default: `false`.
|
||||
|
||||
`--update-quality|-q`
|
||||
: Replace the per-base quality scores of each primary sequence with the quality scores from its matched partner. Relevant only when both datasets carry quality information (FASTQ). Default: `false`.
|
||||
|
||||
## Input format
|
||||
|
||||
`--csv`
|
||||
: Read the primary input data in OBITools CSV format (e.g., sequences exported by `obicsv`). This flag applies to the primary input only; secondary files supplied via `--join-with` are always auto-detected. Default: `false`. <!-- corrected: --csv affects primary input only, not the secondary annotation file -->
|
||||
|
||||
`--ecopcr`
|
||||
: Read data following the ecoPCR output format. Default: `false`.
|
||||
|
||||
`--embl`
|
||||
: Read data following the EMBL flatfile format. Default: `false`.
|
||||
|
||||
`--fasta`
|
||||
: Read data following the FASTA format. Default: `false`.
|
||||
|
||||
`--fastq`
|
||||
: Read data following the FASTQ format. Default: `false`.
|
||||
|
||||
`--genbank`
|
||||
: Read data following the GenBank flatfile format. Default: `false`.
|
||||
|
||||
`--input-OBI-header`
|
||||
: Treat FASTA/FASTQ title line annotations as OBI format. Default: `false`.
|
||||
|
||||
`--input-json-header`
|
||||
: Treat FASTA/FASTQ title line annotations as JSON format. Default: `false`.
|
||||
|
||||
`--solexa`
|
||||
: Decode the quality string according to the Solexa specification. Default: `false`.
|
||||
|
||||
`--u-to-t`
|
||||
: Convert uracil (U) to thymine (T) in input sequences. Default: `false`.
|
||||
|
||||
`--skip-empty`
|
||||
: Suppress sequences of length zero from the output. Default: `false`.
|
||||
|
||||
`--no-order`
|
||||
: When several input files are provided, indicates that there is no order among them. Default: `false`.
|
||||
|
||||
## Output format
|
||||
|
||||
`--out|-o <FILENAME>`
|
||||
: Filename used for saving the output. Default: `-` (standard output).
|
||||
|
||||
`--fasta-output`
|
||||
: Write sequences in FASTA format (default when no quality data are available). Default: `false`.
|
||||
|
||||
`--fastq-output`
|
||||
: Write sequences in FASTQ format (default when quality data are available). Default: `false`.
|
||||
|
||||
`--json-output`
|
||||
: Write sequences in JSON format. Default: `false`.
|
||||
|
||||
`--output-OBI-header|-O`
|
||||
: Output FASTA/FASTQ title line annotations in OBI format. Default: `false`.
|
||||
|
||||
`--output-json-header`
|
||||
: Output FASTA/FASTQ title line annotations in JSON format. Default: `false`.
|
||||
|
||||
`--compress|-Z`
|
||||
: Compress the output using gzip. Default: `false`.
|
||||
|
||||
## Taxonomy
|
||||
|
||||
`--taxonomy|-t <string>`
|
||||
: Path to the taxonomy database. Default: `""`.
|
||||
|
||||
`--fail-on-taxonomy`
|
||||
: Cause `obijoin` to fail with an error if a taxid encountered is not currently valid. Default: `false`.
|
||||
|
||||
`--raw-taxid`
|
||||
: Print taxids in files without supplementary information (taxon name and rank). Default: `false`.
|
||||
|
||||
`--update-taxid`
|
||||
: Automatically update taxids that are declared as merged to a newer one. Default: `false`.
|
||||
|
||||
`--with-leaves`
|
||||
: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation. Default: `false`.
|
||||
|
||||
## Performance
|
||||
|
||||
`--max-cpu <int>`
|
||||
: Number of parallel threads used to compute the result. Default: `16`.
|
||||
|
||||
`--batch-size <int>`
|
||||
: Minimum number of sequences per processing batch. Default: `1`.
|
||||
|
||||
`--batch-size-max <int>`
|
||||
: Maximum number of sequences per processing batch. Default: `2000`.
|
||||
|
||||
`--batch-mem <string>`
|
||||
: Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable. Default: `128M`.
|
||||
|
||||
## Diagnostics
|
||||
|
||||
`--no-progressbar`
|
||||
: Disable the progress bar. Default: `false`.
|
||||
|
||||
`--silent-warning`
|
||||
: Stop printing warning messages. Default: `false`.
|
||||
|
||||
`--debug`
|
||||
: Enable debug mode by setting the log level to debug. Default: `false`.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Annotate amplicon sequences with sample metadata from a CSV table,
|
||||
# matching on the sample attribute. CSV format is auto-detected.
|
||||
obijoin --join-with metadata.csv --by sample input.fasta > out_basic.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_basic.fasta`.
|
||||
|
||||
```bash
|
||||
# Join using a cross-attribute key: primary sequences have a 'sample' attribute,
|
||||
# while the annotation CSV uses 'well' for the same identifier.
|
||||
obijoin --join-with well_metadata.csv --by sample=well input.fasta > out_crosskey.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_crosskey.fasta`.
|
||||
|
||||
```bash
|
||||
# Join on two keys simultaneously: match only when both sample and barcode agree,
|
||||
# then update sequence identifiers with those from the reference file.
|
||||
obijoin --join-with references.fasta \
|
||||
--by sample --by barcode \
|
||||
--update-id \
|
||||
input.fasta > out_multikey.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_multikey.fasta`.
|
||||
|
||||
```bash
|
||||
# Replace sequences and quality scores of reads with values from a corrected FASTQ file,
|
||||
# joining by sequence ID (default when no --by is specified).
|
||||
obijoin --join-with corrected.fastq \
|
||||
--update-sequence --update-quality \
|
||||
input.fastq > out_updated.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_updated.fastq`.
|
||||
|
||||
```bash
|
||||
# Use an OBITools CSV file as primary input (--csv flag), join with a metadata CSV,
|
||||
# then write compressed FASTA output without showing the progress bar.
|
||||
obijoin --join-with metadata.csv --by sample \
|
||||
--csv --fasta-output --compress \
|
||||
--no-progressbar \
|
||||
--out out_compressed.fasta.gz \
|
||||
primary.csv
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_compressed.fasta.gz`.
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- The secondary file supplied via `--join-with` is loaded entirely into memory before the join begins. For very large secondary files this may require significant RAM.
|
||||
- Key matching is based on exact string equality; no regular expression or fuzzy matching is applied.
|
||||
- The join is a left outer join: primary sequences without a matching partner in the secondary file are still emitted, unchanged, in the output.
|
||||
- When the annotation source is a plain CSV spreadsheet (columns = attributes, rows = records), the format is auto-detected — no `--csv` flag is needed. The `--csv` flag applies exclusively to the primary input and is intended for sequences stored in OBITools CSV format.
|
||||
@@ -0,0 +1,205 @@
|
||||
# NAME
|
||||
|
||||
obimicrosat — looks for microsatellites sequences in a sequence file
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obimicrosat [options] [<filename>...]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
|
||||
microsatellites — tandem repetitions of a short motif (1–6 bp by default). For each
|
||||
sequence containing a qualifying repeat, the command annotates it with the location,
|
||||
unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
|
||||
with no detected microsatellite are silently discarded.
|
||||
|
||||
The detection works in two passes. A first regular expression finds any tandem repeat
|
||||
satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
|
||||
is then determined, and a second scan refines the exact boundaries. The repeat unit is
|
||||
normalized to its lexicographically smallest rotation across all rotations and its
|
||||
reverse complement, which allows equivalent loci to be grouped consistently across
|
||||
samples.
|
||||
|
||||
By default, when the canonical form of a unit requires the reverse complement, the
|
||||
whole sequence is reoriented so that the microsatellite is always reported on the
|
||||
direct strand of the normalized unit. This behaviour can be disabled with
|
||||
`--not-reoriented`.
|
||||
|
||||
A common use case is identifying polymorphic SSR markers for population genetics, or
|
||||
flagging repeat-rich regions before designing PCR primers.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
Accepts one or more sequence files on the command line. If no file is given, sequences
|
||||
are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
|
||||
EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
|
||||
Format is detected automatically unless overridden by input flags.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
Outputs only the sequences in which a microsatellite was found. Each retained sequence
|
||||
carries the following additional attributes:
|
||||
|
||||
| Attribute | Content |
|
||||
|---|---|
|
||||
| `microsat` | Full repeat region as a string |
|
||||
| `microsat_from` | 1-based start position of the repeat |
|
||||
| `microsat_to` | End position of the repeat (inclusive) |
|
||||
| `microsat_unit` | Repeat unit as observed in the sequence |
|
||||
| `microsat_unit_normalized` | Lexicographically smallest canonical form |
|
||||
| `microsat_unit_orientation` | `direct` or `reverse` |
|
||||
| `microsat_unit_length` | Length of the repeat unit (bp) |
|
||||
| `microsat_unit_count` | Number of complete unit repetitions |
|
||||
| `seq_length` | Total length of the (possibly reoriented) sequence |
|
||||
| `microsat_left` | Flanking sequence to the left of the repeat |
|
||||
| `microsat_right` | Flanking sequence to the right of the repeat |
|
||||
|
||||
When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
|
||||
identifier.
|
||||
|
||||
The output format follows the same rules as the rest of OBITools4: FASTQ when quality
|
||||
scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
|
||||
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
|
||||
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
|
||||
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
|
||||
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
|
||||
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Microsatellite detection
|
||||
|
||||
**`--min-unit-length` / `-m`**
|
||||
- Default: `1`
|
||||
- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
|
||||
mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
|
||||
|
||||
**`--max-unit-length` / `-M`**
|
||||
- Default: `6`
|
||||
- Maximum length in base pairs of the repeated motif. Increasing this value detects
|
||||
longer repeat units (minisatellites) at the cost of more complex patterns.
|
||||
|
||||
**`--min-unit-count`**
|
||||
- Default: `5`
|
||||
- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
|
||||
requires at least 10 bp of pure repeat.
|
||||
|
||||
**`--min-length` / `-l`**
|
||||
- Default: `20`
|
||||
- Minimum total length (in bp) of the repeat region. This filter applies after the
|
||||
unit-count filter and is useful to exclude very short but technically qualifying
|
||||
repeats.
|
||||
|
||||
**`--min-flank-length` / `-f`**
|
||||
- Default: `0`
|
||||
- Minimum length of the flanking sequence on each side of the repeat. Sequences with
|
||||
flanks shorter than this threshold are discarded, which is useful when the output
|
||||
will feed a primer-design step.
|
||||
|
||||
**`--not-reoriented` / `-n`**
|
||||
- Default: `false` (reorientation is active by default)
|
||||
- When set, sequences are never reverse-complemented to match the canonical orientation
|
||||
of the repeat unit. The microsatellite is reported as found, in its original
|
||||
orientation.
|
||||
|
||||
## Input / output
|
||||
|
||||
Inherited from the standard OBITools4 conversion layer. Common flags include:
|
||||
|
||||
**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
|
||||
**`--input-json-header`** — parse JSON-encoded headers.
|
||||
**`--skip-empty`** — skip sequences with no nucleotides.
|
||||
**`--u-to-t`** — convert U to T (RNA → DNA).
|
||||
**`--output-json-header`** — write JSON-encoded headers.
|
||||
**`--output-obi-header`** — write OBI-style headers.
|
||||
**`--gzip`** — compress output with gzip.
|
||||
**`--workers` / `-p`** — number of parallel processing workers.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Detect default microsatellites (unit 1–6 bp, ≥5 repeats, ≥20 bp total)
|
||||
obimicrosat sequences.fasta > out_default.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_default.fasta`.
|
||||
|
||||
```bash
|
||||
# Restrict to di- and trinucleotide repeats only
|
||||
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
|
||||
(mononucleotide and tetranucleotide repeats excluded).
|
||||
|
||||
```bash
|
||||
# Require at least 30 bp flanking sequence on each side (for primer design)
|
||||
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_primer_ready.fasta`
|
||||
(sequences with flanks shorter than 30 bp are discarded).
|
||||
|
||||
```bash
|
||||
# Keep sequences in their original orientation (no reverse-complement)
|
||||
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_no_reorient.fasta`
|
||||
(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
|
||||
|
||||
```bash
|
||||
# Require at least 8 repeat units and a minimum repeat length of 30 bp
|
||||
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_strict.fasta`
|
||||
(short or low-count repeats excluded).
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
`obigrep` — filter sequences by annotation after microsatellite detection.
|
||||
`obiannotate` — add or modify sequence annotations.
|
||||
`obiconvert` — format conversion for sequence files.
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- Only sequences with at least one qualifying microsatellite are written to output;
|
||||
all others are silently filtered out.
|
||||
- The normalization algorithm considers all rotations of the unit and their reverse
|
||||
complements, selecting the lexicographically smallest string. This ensures consistent
|
||||
grouping of loci regardless of which strand was sequenced.
|
||||
- When reorientation is active (the default), sequences whose canonical unit falls on
|
||||
the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
|
||||
Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
|
||||
(possibly reoriented) output sequence.
|
||||
- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
|
||||
first match is reported per sequence.
|
||||
- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
|
||||
repeat within the flank instead of the intended SSR. When designing synthetic test
|
||||
data, ensure flanking regions do not contain tandem repeat motifs of their own.
|
||||
@@ -0,0 +1,384 @@
|
||||
# NAME
|
||||
|
||||
obiscript — executes a lua script on the input sequences
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obiscript [--allows-indels] [--approx-pattern <PATTERN>]...
|
||||
[--attribute|-a <KEY=VALUE>]... [--batch-mem <string>]
|
||||
[--batch-size <int>] [--batch-size-max <int>] [--compress|-Z]
|
||||
[--csv] [--debug] [--definition|-D <PATTERN>]... [--ecopcr]
|
||||
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
|
||||
[--fastq-output] [--genbank] [--has-attribute|-A <KEY>]...
|
||||
[--help|-h|-?] [--id-list <FILENAME>]
|
||||
[--identifier|-I <PATTERN>]... [--ignore-taxon|-i <TAXID>]...
|
||||
[--input-OBI-header] [--input-json-header] [--inverse-match|-v]
|
||||
[--json-output] [--max-count|-C <COUNT>] [--max-cpu <int>]
|
||||
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
|
||||
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
|
||||
[--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
|
||||
[--output-json-header]
|
||||
[--paired-mode <forward|reverse|and|or|andnot|xor>]
|
||||
[--pattern-error <int>] [--pprof] [--pprof-goroutine <int>]
|
||||
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
|
||||
[--raw-taxid] [--require-rank <RANK_NAME>]...
|
||||
[--restrict-to-taxon|-r <TAXID>]... [--script|-S <string>]
|
||||
[--sequence|-s <PATTERN>]... [--silent-warning] [--skip-empty]
|
||||
[--solexa] [--taxonomy|-t <string>] [--template] [--u-to-t]
|
||||
[--update-taxid] [--valid-taxid] [--version] [--with-leaves]
|
||||
[<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obiscript` applies a user-provided Lua script to a stream of biological sequences. For each input sequence record, the script's `worker(sequence)` function is called, giving the user full programmatic access to the sequence's identifier, data, quality scores, and metadata attributes. This makes it possible to implement custom annotation logic, computed filters, or record transformations that go beyond what fixed-function OBITools commands offer.
|
||||
|
||||
The Lua script may also define two optional lifecycle hooks: `begin()`, called once before any sequence is processed (useful for initialising counters or opening files), and `finish()`, called after the last sequence (useful for printing summary statistics or flushing output). A thread-safe shared table `obicontext` is available across all workers, allowing aggregation across parallel executions.
|
||||
|
||||
Sequences are read from files or standard input in any format supported by OBITools4 (FASTA, FASTQ, EMBL, GenBank, ecoPCR, CSV). The sequence filtering flags (such as `--min-length`, `--predicate`, etc.) select which sequences the Lua script is applied to; sequences that do not match the filter pass through to the output unchanged without the script being executed on them. All sequences — scripted or not — are written to the output. <!-- corrected: non-matching sequences are passed through unchanged, not discarded -->
|
||||
|
||||
To get started, use `--template` to print a minimal Lua script skeleton with stubs for all three hooks and inline documentation.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obiscript` reads biological sequences from one or more files supplied as positional arguments, or from standard input if no files are given. All formats supported by OBITools4 are accepted: FASTA, FASTQ, EMBL flatfile, GenBank flatfile, ecoPCR output, and CSV. Format auto-detection is used by default; explicit format flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) override it. Header annotation style can be forced with `--input-OBI-header` or `--input-json-header`.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
Sequences processed by the Lua script are written to standard output, or to the file given by `--out`. Any modifications made to sequence records inside `worker()` (identifier, sequence, attributes) are reflected in the output. The output format defaults to FASTA when no quality data are present and to FASTQ otherwise; use `--fasta-output`, `--fastq-output`, or `--json-output` to override. Header annotation style in FASTA/FASTQ output can be set with `--output-OBI-header` or `--output-json-header`. Output can be gzip-compressed with `--compress`.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>sample1_seq001 {"definition":"control sequence for annotation test","sample":"sample1"}
|
||||
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
|
||||
>sample1_seq002 {"definition":"another control sequence from sample1","sample":"sample1"}
|
||||
gctagctagctagctagctagctagctagctagctagctagctagcta
|
||||
>sample2_seq003 {"definition":"second sample sequence","sample":"sample2"}
|
||||
ttaattaattaattaattaattaattaattaattaattaattaattaa
|
||||
>sample2_seq004 {"definition":"second sample another sequence","sample":"sample2"}
|
||||
ccggccggccggccggccggccggccggccggccggccggccggccgg
|
||||
>sample3_seq005 {"definition":"third sample first sequence","sample":"sample3"}
|
||||
aaaattttccccggggaaaattttccccggggaaaattttccccgggg
|
||||
>sample3_seq006 {"definition":"third sample second sequence","sample":"sample3"}
|
||||
ttttaaaaccccggggttttaaaaccccggggttttaaaaccccgggg
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Script
|
||||
|
||||
### `--script|-S <string>`
|
||||
- Default: `""`
|
||||
- Path to the Lua script file to execute. The file must exist and be syntactically valid Lua. The script should define a `worker(sequence)` function, and optionally `begin()` and `finish()`.
|
||||
|
||||
### `--template`
|
||||
- Default: `false`
|
||||
- Print a minimal Lua script template to standard output, with stubs for `begin()`, `worker()`, and `finish()` and inline documentation, then exit. Use this to bootstrap a new script.
|
||||
|
||||
## Sequence filtering (selects sequences on which the script is applied; non-matching sequences pass through unchanged)
|
||||
|
||||
### `--predicate|-p <EXPRESSION>`
|
||||
- Default: `[]`
|
||||
- Boolean expression evaluated for each sequence record. Attribute keys are accessible as variable names; `sequence` refers to the record itself. Multiple `-p` options are combined with AND.
|
||||
|
||||
### `--sequence|-s <PATTERN>`
|
||||
- Default: `[]`
|
||||
- Regular expression matched against the nucleotide sequence. Case-insensitive. Multiple patterns are combined with AND.
|
||||
|
||||
### `--identifier|-I <PATTERN>`
|
||||
- Default: `[]`
|
||||
- Regular expression matched against the sequence identifier. Case-insensitive.
|
||||
|
||||
### `--definition|-D <PATTERN>`
|
||||
- Default: `[]`
|
||||
- Regular expression matched against the sequence definition line. Case-insensitive.
|
||||
|
||||
### `--approx-pattern <PATTERN>`
|
||||
- Default: `[]`
|
||||
- Pattern matched approximately against the sequence. Use `--pattern-error` to set the maximum number of errors allowed.
|
||||
|
||||
### `--pattern-error <int>`
|
||||
- Default: `0`
|
||||
- Maximum number of errors (mismatches) allowed during approximate pattern matching.
|
||||
|
||||
### `--allows-indels`
|
||||
- Default: `false`
|
||||
- Allow insertions and deletions (in addition to mismatches) during approximate pattern matching.
|
||||
|
||||
### `--only-forward`
|
||||
- Default: `false`
|
||||
- Restrict pattern matching to the forward strand only.
|
||||
|
||||
### `--has-attribute|-A <KEY>`
|
||||
- Default: `[]`
|
||||
- Apply the script only to records that have an attribute with key `<KEY>`; others pass through.
|
||||
|
||||
### `--attribute|-a <KEY=VALUE>`
|
||||
- Default: `{}`
|
||||
- Apply the script only to records where the attribute `KEY` matches the regular expression `VALUE`. Case-sensitive. Multiple `-a` options are combined with AND.
|
||||
|
||||
### `--id-list <FILENAME>`
|
||||
- Default: `""`
|
||||
- Path to a text file containing one sequence identifier per line. The script is applied only to records whose identifier appears in the file; others pass through.
|
||||
|
||||
### `--min-length|-l <LENGTH>`
|
||||
- Default: `1`
|
||||
- Apply the script only to sequences whose length is at least `LENGTH`; shorter sequences pass through unchanged.
|
||||
|
||||
### `--max-length|-L <LENGTH>`
|
||||
- Default: `2000000000`
|
||||
- Apply the script only to sequences whose length is at most `LENGTH`; longer sequences pass through unchanged.
|
||||
|
||||
### `--min-count|-c <COUNT>`
|
||||
- Default: `1`
|
||||
- Apply the script only to sequences with a count (abundance) of at least `COUNT`; others pass through unchanged.
|
||||
|
||||
### `--max-count|-C <COUNT>`
|
||||
- Default: `2000000000`
|
||||
- Apply the script only to sequences with a count (abundance) of at most `COUNT`; others pass through unchanged.
|
||||
|
||||
### `--inverse-match|-v`
|
||||
- Default: `false`
|
||||
- Invert the selection: apply the script to records that do NOT match the filter criteria; matching records pass through unchanged.
|
||||
|
||||
## Taxonomic filtering
|
||||
|
||||
### `--taxonomy|-t <string>`
|
||||
- Default: `""`
|
||||
- Path to the taxonomy database. Required for taxonomy-based options.
|
||||
|
||||
### `--restrict-to-taxon|-r <TAXID>`
|
||||
- Default: `[]`
|
||||
- Retain only sequences whose taxid belongs to the specified taxon.
|
||||
|
||||
### `--ignore-taxon|-i <TAXID>`
|
||||
- Default: `[]`
|
||||
- Exclude sequences whose taxid belongs to the specified taxon.
|
||||
|
||||
### `--require-rank <RANK_NAME>`
|
||||
- Default: `[]`
|
||||
- Retain only sequences whose taxon has the specified rank (e.g., `species`, `genus`).
|
||||
|
||||
### `--valid-taxid`
|
||||
- Default: `false`
|
||||
- Retain only sequences that carry a currently valid NCBI taxid.
|
||||
|
||||
### `--fail-on-taxonomy`
|
||||
- Default: `false`
|
||||
- Abort with an error if a taxid used during filtering is not currently valid.
|
||||
|
||||
### `--update-taxid`
|
||||
- Default: `false`
|
||||
- Automatically replace taxids declared as merged with their current equivalent.
|
||||
|
||||
### `--raw-taxid`
|
||||
- Default: `false`
|
||||
- Print taxids in output without supplementary information (taxon name and rank).
|
||||
|
||||
### `--with-leaves`
|
||||
- Default: `false`
|
||||
- When extracting taxonomy from a sequence file, attach sequences as leaves of their taxid annotation.
|
||||
|
||||
## Paired-end mode
|
||||
|
||||
### `--paired-mode <forward|reverse|and|or|andnot|xor>`
|
||||
- Default: `"forward"`
|
||||
- When paired reads are provided, determines how filter conditions are applied to both reads of a pair.
|
||||
|
||||
## Input format
|
||||
|
||||
### `--fasta`
|
||||
- Default: `false`
|
||||
- Force FASTA format parsing.
|
||||
|
||||
### `--fastq`
|
||||
- Default: `false`
|
||||
- Force FASTQ format parsing.
|
||||
|
||||
### `--embl`
|
||||
- Default: `false`
|
||||
- Force EMBL flatfile format parsing.
|
||||
|
||||
### `--genbank`
|
||||
- Default: `false`
|
||||
- Force GenBank flatfile format parsing.
|
||||
|
||||
### `--ecopcr`
|
||||
- Default: `false`
|
||||
- Force ecoPCR output format parsing.
|
||||
|
||||
### `--csv`
|
||||
- Default: `false`
|
||||
- Force CSV format parsing.
|
||||
|
||||
### `--input-OBI-header`
|
||||
- Default: `false`
|
||||
- Parse FASTA/FASTQ title line annotations as OBI format.
|
||||
|
||||
### `--input-json-header`
|
||||
- Default: `false`
|
||||
- Parse FASTA/FASTQ title line annotations as JSON format.
|
||||
|
||||
### `--solexa`
|
||||
- Default: `false`
|
||||
- Decode quality strings according to the Solexa specification.
|
||||
|
||||
### `--u-to-t`
|
||||
- Default: `false`
|
||||
- Convert uracil (U) to thymine (T) in sequences.
|
||||
|
||||
### `--skip-empty`
|
||||
- Default: `false`
|
||||
- Suppress sequences of length zero from the output.
|
||||
|
||||
### `--no-order`
|
||||
- Default: `false`
|
||||
- When multiple input files are provided, indicates that no ordering is assumed among them.
|
||||
|
||||
## Output format
|
||||
|
||||
### `--out|-o <FILENAME>`
|
||||
- Default: `"-"` (standard output)
|
||||
- File path for saving the output.
|
||||
|
||||
### `--fasta-output`
|
||||
- Default: `false`
|
||||
- Write output in FASTA format.
|
||||
|
||||
### `--fastq-output`
|
||||
- Default: `false`
|
||||
- Write output in FASTQ format.
|
||||
|
||||
### `--json-output`
|
||||
- Default: `false`
|
||||
- Write output in JSON format.
|
||||
|
||||
### `--output-OBI-header|-O`
|
||||
- Default: `false`
|
||||
- Write FASTA/FASTQ title line annotations in OBI format.
|
||||
|
||||
### `--output-json-header`
|
||||
- Default: `false`
|
||||
- Write FASTA/FASTQ title line annotations in JSON format.
|
||||
|
||||
### `--compress|-Z`
|
||||
- Default: `false`
|
||||
- Compress output using gzip.
|
||||
|
||||
## Performance
|
||||
|
||||
### `--max-cpu <int>`
|
||||
- Default: `16` (env: `OBIMAXCPU`)
|
||||
- Number of parallel threads used for processing.
|
||||
|
||||
### `--batch-size <int>`
|
||||
- Default: `1` (env: `OBIBATCHSIZE`)
|
||||
- Minimum number of sequences per processing batch.
|
||||
|
||||
### `--batch-size-max <int>`
|
||||
- Default: `2000` (env: `OBIBATCHSIZEMAX`)
|
||||
- Maximum number of sequences per processing batch.
|
||||
|
||||
### `--batch-mem <string>`
|
||||
- Default: `""` → `128M` (env: `OBIBATCHMEM`)
|
||||
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
|
||||
|
||||
## Diagnostics
|
||||
|
||||
### `--debug`
|
||||
- Default: `false` (env: `OBIDEBUG`)
|
||||
- Enable debug logging.
|
||||
|
||||
### `--no-progressbar`
|
||||
- Default: `false`
|
||||
- Disable the progress bar.
|
||||
|
||||
### `--silent-warning`
|
||||
- Default: `false` (env: `OBIWARNING`)
|
||||
- Suppress warning messages.
|
||||
|
||||
### `--pprof`
|
||||
- Default: `false`
|
||||
- Enable the pprof profiling HTTP server (see log for address).
|
||||
|
||||
### `--pprof-goroutine <int>`
|
||||
- Default: `6060` (env: `OBIPPROFGOROUTINE`)
|
||||
- Port for goroutine blocking profile.
|
||||
|
||||
### `--pprof-mutex <int>`
|
||||
- Default: `10` (env: `OBIPPROFMUTEX`)
|
||||
- Rate for mutex lock profiling.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Print a starter script template and save it to my_script.lua
|
||||
obiscript --template > my_script.lua
|
||||
```
|
||||
|
||||
**Expected output:** Lua template with `begin()`, `worker()`, and `finish()` stubs written to `my_script.lua`.
|
||||
|
||||
```bash
|
||||
# Add a custom annotation to every sequence record
|
||||
# (the script sets a new attribute 'sample' from the identifier prefix)
|
||||
obiscript --script annotate.lua --fasta-output sequences.fasta > annotated.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `annotated.fasta`.
|
||||
|
||||
```bash
|
||||
# Count reads per taxon using the finish() hook, filtering to a specific taxon
|
||||
obiscript --script count_taxa.lua \
|
||||
--restrict-to-taxon 6231 \
|
||||
--taxonomy /data/ncbi_tax \
|
||||
sequences.fasta > filtered_annotated.fasta
|
||||
```
|
||||
|
||||
```bash
|
||||
# Apply a script to FASTQ sequences with a length filter
|
||||
obiscript --script process_pairs.lua \
|
||||
--min-length 100 \
|
||||
--out result.fastq \
|
||||
reads.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `result.fastq`.
|
||||
|
||||
```bash
|
||||
# Run on FASTQ input, output JSON, using 4 CPU threads
|
||||
obiscript --script enrich.lua \
|
||||
--json-output \
|
||||
--max-cpu 4 \
|
||||
sequences.fastq > enriched.json
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `enriched.json`.
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
`obigrep` — filter sequences using the same selection criteria without scripting.
|
||||
`obiannotate` — add or modify sequence attributes without scripting.
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- The Lua `worker(sequence)` function is called in parallel across multiple goroutines. Use the thread-safe `obicontext` table (with `obicontext:lock()` / `obicontext:unlock()`) for any shared mutable state accessed across workers.
|
||||
- The `begin()` and `finish()` hooks each run in a single goroutine and do not need locking for their own internal state.
|
||||
- Sequence records modified inside `worker()` must be returned (or the original returned unmodified) for the record to appear in the output. Returning `nil` drops the sequence.
|
||||
@@ -0,0 +1,271 @@
|
||||
# NAME
|
||||
|
||||
obisummary — resume main information from a sequence file
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obisummary [--batch-mem <string>] [--batch-size <int>]
|
||||
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
|
||||
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
|
||||
[--input-OBI-header] [--input-json-header] [--json-output]
|
||||
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
|
||||
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
|
||||
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obisummary` reads a set of biological sequences and computes a statistical
|
||||
summary of their content and annotations. Rather than producing a new sequence
|
||||
file, it outputs a single structured record describing the dataset as a whole.
|
||||
|
||||
The summary covers three main areas. First, global counts: the total number of
|
||||
reads (sequences weighted by their `count` attribute), the number of distinct
|
||||
sequence variants, and the total sequence length across all records. Second,
|
||||
annotation profiling: `obisummary` inspects every annotation key present in
|
||||
the dataset and classifies it as a scalar attribute (single value per
|
||||
sequence), a map attribute (key-to-count mapping), or a vector attribute
|
||||
(multi-value per sequence). Third, per-sample statistics: when sequences carry
|
||||
sample information (via `merged_sample` or equivalent per-sample annotations),
|
||||
`obisummary` reports for each sample the number of reads, the number of
|
||||
variants, and the number of singletons. If `obiclean` has been run previously,
|
||||
the summary also captures `obiclean_status` and related quality flags per
|
||||
sample.
|
||||
|
||||
The output is a single JSON record by default, or YAML when `--yaml-output` is
|
||||
requested. <!-- corrected: actual default output is JSON, not YAML -->
|
||||
`obisummary` is typically used after processing steps such as
|
||||
`obiclean` or `obiuniq` to quickly validate the state of a dataset before
|
||||
downstream analysis.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obisummary` accepts biological sequence data from one or more files supplied
|
||||
as positional arguments, or from standard input when no files are given.
|
||||
Supported formats include FASTA, FASTQ, GenBank flatfile, EMBL flatfile,
|
||||
ecoPCR output, and CSV. By default the format is detected automatically; use
|
||||
the format flags (`--fasta`, `--fastq`, `--genbank`, `--embl`, `--ecopcr`,
|
||||
`--csv`) to force a specific parser.
|
||||
|
||||
FASTA/FASTQ annotation headers may follow the OBI format (`--input-OBI-header`)
|
||||
or JSON format (`--input-json-header`). RNA sequences can be read as DNA by
|
||||
converting uracil to thymine with `--u-to-t`. Quality strings encoded according
|
||||
to the Solexa specification are handled with `--solexa`.
|
||||
|
||||
When multiple input files are provided, `obisummary` assumes they are ordered;
|
||||
use `--no-order` to indicate that no ordering exists among them.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
`obisummary` writes a single structured record to standard output. The default
|
||||
format is JSON; use `--yaml-output` to obtain YAML instead.
|
||||
<!-- corrected: actual default output is JSON, not YAML -->
|
||||
|
||||
The record contains three top-level sections:
|
||||
|
||||
- **`count`**: global metrics including `variants` (distinct sequences),
|
||||
`reads` (total weighted count), and `total_length` (sum of all sequence
|
||||
lengths).
|
||||
|
||||
- **`annotations`**: a breakdown of all annotation keys found in the dataset,
|
||||
classified as `scalar_attributes`, `map_attributes`, or `vector_attributes`,
|
||||
together with the observed keys and their occurrence counts within each
|
||||
category.
|
||||
|
||||
- **`samples`**: when sample information is present, `sample_count` and a
|
||||
per-sample `sample_stats` table with `reads`, `variants`, and `singletons`
|
||||
fields. If `obiclean` data is present, an `obiclean_bad` field is also
|
||||
reported per sample.
|
||||
|
||||
When `--map` is used, the named map attribute is included in the annotation
|
||||
detail for that attribute.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
{
|
||||
"annotations": {
|
||||
"keys": {
|
||||
"scalar": {
|
||||
"count": 5
|
||||
}
|
||||
},
|
||||
"map_attributes": 0,
|
||||
"scalar_attributes": 1,
|
||||
"vector_attributes": 0
|
||||
},
|
||||
"count": {
|
||||
"reads": 21,
|
||||
"total_length": 100,
|
||||
"variants": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Summary output
|
||||
|
||||
**`--json-output`**
|
||||
- Default: `false`
|
||||
- Print the result as a JSON record (this is the default behaviour; this flag
|
||||
makes the choice explicit).
|
||||
<!-- corrected: JSON is the default output format, not YAML -->
|
||||
|
||||
**`--yaml-output`**
|
||||
- Default: `false`
|
||||
- Print the result as a YAML record instead of the default JSON format.
|
||||
<!-- corrected: YAML is not the default; JSON is -->
|
||||
|
||||
**`--map <string>`**
|
||||
- Default: `[]` (none)
|
||||
- Name of a map attribute to include in the summary. This option may be
|
||||
repeated to request multiple map attributes. Each named attribute will be
|
||||
detailed in the `map_attributes` section of the output.
|
||||
|
||||
## Input format
|
||||
|
||||
**`--fasta`**
|
||||
- Default: `false`
|
||||
- Read data following the FASTA format.
|
||||
|
||||
**`--fastq`**
|
||||
- Default: `false`
|
||||
- Read data following the FASTQ format.
|
||||
|
||||
**`--genbank`**
|
||||
- Default: `false`
|
||||
- Read data following the GenBank flatfile format.
|
||||
|
||||
**`--embl`**
|
||||
- Default: `false`
|
||||
- Read data following the EMBL flatfile format.
|
||||
|
||||
**`--ecopcr`**
|
||||
- Default: `false`
|
||||
- Read data following the ecoPCR output format.
|
||||
|
||||
**`--csv`**
|
||||
- Default: `false`
|
||||
- Read data following the CSV format.
|
||||
|
||||
**`--input-OBI-header`**
|
||||
- Default: `false`
|
||||
- FASTA/FASTQ title line annotations follow OBI format.
|
||||
|
||||
**`--input-json-header`**
|
||||
- Default: `false`
|
||||
- FASTA/FASTQ title line annotations follow JSON format.
|
||||
|
||||
**`--solexa`**
|
||||
- Default: `false`
|
||||
- Decode quality strings according to the Solexa specification.
|
||||
|
||||
**`--u-to-t`**
|
||||
- Default: `false`
|
||||
- Convert uracil (U) to thymine (T) when reading RNA sequences.
|
||||
|
||||
## Batch control
|
||||
|
||||
**`--batch-size <int>`**
|
||||
- Default: `1`
|
||||
- Minimum number of sequences per processing batch.
|
||||
|
||||
**`--batch-size-max <int>`**
|
||||
- Default: `2000`
|
||||
- Maximum number of sequences per processing batch.
|
||||
|
||||
**`--batch-mem <string>`**
|
||||
- Default: `""` (128M effective)
|
||||
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable
|
||||
the memory limit.
|
||||
|
||||
## Processing
|
||||
|
||||
**`--max-cpu <int>`**
|
||||
- Default: `16`
|
||||
- Number of parallel threads used to compute the result.
|
||||
|
||||
**`--no-order`**
|
||||
- Default: `false`
|
||||
- When several input files are provided, indicates that there is no order
|
||||
among them.
|
||||
|
||||
## General
|
||||
|
||||
**`--debug`**
|
||||
- Default: `false`
|
||||
- Enable debug mode by setting the log level to debug.
|
||||
|
||||
**`--silent-warning`**
|
||||
- Default: `false`
|
||||
- Stop printing warning messages.
|
||||
|
||||
**`--version`**
|
||||
- Default: `false`
|
||||
- Print the version and exit.
|
||||
|
||||
**`--help` / `-h` / `-?`**
|
||||
- Default: `false`
|
||||
- Display help and exit.
|
||||
|
||||
**`--pprof`**
|
||||
- Default: `false`
|
||||
- Enable the pprof profiling server. Consult the log for the server address.
|
||||
|
||||
**`--pprof-goroutine <int>`**
|
||||
- Default: `6060`
|
||||
- Port for goroutine blocking profile.
|
||||
|
||||
**`--pprof-mutex <int>`**
|
||||
- Default: `10`
|
||||
- Port for mutex lock profiling.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Get a JSON summary of a FASTA file produced by obiclean
|
||||
obisummary cleaned.fasta > out_default.yaml
|
||||
```
|
||||
|
||||
**Expected output:** a JSON summary record in `out_default.yaml`.
|
||||
|
||||
```bash
|
||||
# Get the summary as an explicit JSON record for programmatic processing
|
||||
obisummary --json-output cleaned.fasta > out_json.json
|
||||
```
|
||||
|
||||
**Expected output:** a JSON summary record in `out_json.json`.
|
||||
|
||||
```bash
|
||||
# Get a YAML record from a FASTQ file
|
||||
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
|
||||
```
|
||||
|
||||
**Expected output:** a YAML summary record in `out_yaml.yaml`.
|
||||
|
||||
```bash
|
||||
# Summarise data read from standard input, forcing FASTA format
|
||||
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
|
||||
```
|
||||
|
||||
**Expected output:** a JSON summary record in `out_pipeline.yaml` (3 variants, 10 reads).
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
`obiclean`, `obiuniq`, `obicount`
|
||||
@@ -0,0 +1,347 @@
|
||||
# NAME
|
||||
|
||||
obiuniq — dereplicate sequence data sets
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
|
||||
[--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
|
||||
[--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
|
||||
[--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
|
||||
[--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
|
||||
[--input-OBI-header] [--input-json-header] [--json-output]
|
||||
[--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
|
||||
[--no-order] [--no-progressbar] [--no-singleton]
|
||||
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
|
||||
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
|
||||
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
|
||||
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
|
||||
[--with-leaves] [<args>]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obiuniq` groups identical sequences together and replaces them with a single
|
||||
representative, recording the total number of original occurrences as an
|
||||
abundance count. This process — called dereplication — is a standard step in
|
||||
amplicon sequencing workflows: it dramatically reduces the number of sequence
|
||||
records to process, while preserving exact counts needed for downstream
|
||||
statistical analyses.
|
||||
|
||||
By default, two sequences are considered identical if and only if their
|
||||
nucleotide strings are the same. Using `--category-attribute` (repeatable),
|
||||
additional metadata fields can be included in the identity criterion. For
|
||||
example, grouping by sample name keeps the same sequence as separate records
|
||||
when it occurs in different samples, enabling per-sample abundance tracking.
|
||||
|
||||
For each group of identical sequences, `obiuniq` emits one output record
|
||||
carrying the merged metadata of all members. The `--merge` option (repeatable)
|
||||
instructs the command to also record, in an attribute named `merged_<KEY>`, the
|
||||
distribution of `KEY` attribute values across the sequences collapsed into each
|
||||
group — useful for provenance tracking and quality control. <!-- corrected: actual attribute name is merged_KEY (not KEY); tracks attribute value distributions, not a list of sequence IDs -->
|
||||
|
||||
Sequences that appear only once in the entire dataset (singletons) can be
|
||||
removed with `--no-singleton`. Singletons often represent sequencing errors
|
||||
rather than genuine biological variants, so their removal is a common
|
||||
noise-reduction step.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
`obiuniq` accepts biological sequence data in FASTA, FASTQ, EMBL, GenBank,
|
||||
ecoPCR, or CSV format (auto-detected by default, or forced with format flags
|
||||
such as `--fasta`, `--fastq`, `--embl`, etc.). Input is read from one or more
|
||||
files given as positional arguments, or from standard input when no files are
|
||||
provided.
|
||||
|
||||
When multiple input files are provided, `obiuniq` assumes they are ordered
|
||||
(e.g., paired-end reads in the same read order). If no such ordering exists,
|
||||
use `--no-order` to signal that files can be consumed independently.
|
||||
|
||||
FASTA/FASTQ header annotations are parsed heuristically by default. Use
|
||||
`--input-OBI-header` for OBI-formatted headers or `--input-json-header` for
|
||||
JSON-formatted headers. RNA sequences can be normalised to DNA on the fly with
|
||||
`--u-to-t`.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
`obiuniq` writes dereplicated sequences to standard output or to the file
|
||||
specified by `--out`. Each output record represents one group of identical
|
||||
sequences (identical under the chosen grouping criterion). The output carries
|
||||
the merged metadata from all input records in the group.
|
||||
|
||||
The output format defaults to FASTA. Even when the input contains quality
|
||||
scores (FASTQ), quality information is not preserved across merged sequences,
|
||||
so the output is written in FASTA format unless `--fastq-output` is explicitly
|
||||
requested. <!-- corrected: actual output is always FASTA when dereplicating; quality scores are dropped during merging -->
|
||||
Output annotations follow the OBI header format when `--output-OBI-header` is
|
||||
set, or JSON when `--output-json-header` is set. The output can be
|
||||
gzip-compressed with `--compress`.
|
||||
|
||||
For each output record:
|
||||
- The abundance count reflects how many input sequences were merged into the
|
||||
group.
|
||||
- Attributes created by `--merge KEY` are named `merged_KEY` and map each
|
||||
observed value of the `KEY` attribute to the count of input sequences
|
||||
carrying that value within the group. <!-- corrected: attribute name is merged_KEY; value is a map not a list -->
|
||||
- All other attributes are merged from the contributing records according to
|
||||
the standard OBITools4 merging rules.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq008 {"count":1,"primer":"p1"}
|
||||
cccccccccccccccccccc
|
||||
>seq001 {"count":4,"primer":"p1"}
|
||||
atcgatcgatcgatcgatcg
|
||||
>seq004 {"count":2,"primer":"p1","sample":"s1"}
|
||||
gctagctagctagctagcta
|
||||
>seq007 {"count":1,"primer":"p1","sample":"s2"}
|
||||
tttttttttttttttttttt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Dereplication Options
|
||||
|
||||
**`--category-attribute|-c <CATEGORY>`** (default: `[]`)
|
||||
Adds one metadata attribute to the grouping criterion. Two sequences are
|
||||
placed in the same group only when they are nucleotide-identical **and** share
|
||||
the same value for every attribute listed with `-c`. This option can be
|
||||
repeated to combine multiple attributes (e.g., `-c sample -c primer`).
|
||||
Records that lack a listed attribute receive the value set by `--na-value`.
|
||||
|
||||
**`--chunk-count <int>`** (default: `100`)
|
||||
Controls how many internal partitions the dataset is split into during
|
||||
processing. A higher value reduces per-partition memory usage at the cost of
|
||||
more temporary files; a lower value increases per-partition memory but reduces
|
||||
I/O overhead. Tune this when processing very large or very small datasets.
|
||||
|
||||
**`--in-memory`** (default: `false`)
|
||||
Stores intermediate data chunks in RAM rather than in temporary disk files.
|
||||
Speeds up processing on datasets that fit comfortably in available memory;
|
||||
omit this flag (the default) for large datasets that exceed available RAM.
|
||||
|
||||
**`--merge|-m <KEY>`** (default: `[]`)
|
||||
Creates an output attribute named `merged_KEY` that maps each observed value
|
||||
of the `KEY` attribute to the count of input sequences carrying that value
|
||||
within the group. Repeat to track multiple attributes. <!-- corrected: actual attribute name is merged_KEY (not KEY); value is a map of attribute values to counts, not a list of sequence IDs -->
|
||||
Useful for tracking which sample or category contributions were collapsed into each group.
|
||||
|
||||
**`--na-value <NA_NAME>`** (default: `"NA"`)
|
||||
Value assigned to a category attribute when a sequence record does not carry
|
||||
that attribute. All sequences lacking the attribute are grouped together under
|
||||
this placeholder, rather than being treated as incomparable.
|
||||
|
||||
**`--no-singleton`** (default: `false`)
|
||||
Discards all output records whose abundance count is exactly one — i.e.,
|
||||
sequences that occur only once across the entire input. Removing singletons
|
||||
is a standard heuristic for excluding sequencing errors from further analysis.
|
||||
|
||||
## Input Options
|
||||
|
||||
**`--batch-mem <string>`** (default: `""`, env: `OBIBATCHMEM`)
|
||||
Maximum memory budget per processing batch (e.g. `128K`, `64M`, `1G`). Set
|
||||
to `0` to disable the memory ceiling. Overrides `--batch-size-max` when
|
||||
both are set.
|
||||
|
||||
**`--batch-size <int>`** (default: `10`, env: `OBIBATCHSIZE`)
|
||||
Minimum number of sequences per batch (floor).
|
||||
|
||||
**`--batch-size-max <int>`** (default: `2000`, env: `OBIBATCHSIZEMAX`)
|
||||
Maximum number of sequences per batch (ceiling).
|
||||
|
||||
**`--csv`** (default: `false`)
|
||||
Parse input as CSV format.
|
||||
|
||||
**`--ecopcr`** (default: `false`)
|
||||
Parse input as ecoPCR output format.
|
||||
|
||||
**`--embl`** (default: `false`)
|
||||
Parse input as EMBL flatfile format.
|
||||
|
||||
**`--fasta`** (default: `false`)
|
||||
Parse input as FASTA format.
|
||||
|
||||
**`--fastq`** (default: `false`)
|
||||
Parse input as FASTQ format.
|
||||
|
||||
**`--genbank`** (default: `false`)
|
||||
Parse input as GenBank flatfile format.
|
||||
|
||||
**`--input-OBI-header`** (default: `false`)
|
||||
Treat FASTA/FASTQ title line annotations as OBI-format key=value pairs.
|
||||
|
||||
**`--input-json-header`** (default: `false`)
|
||||
Treat FASTA/FASTQ title line annotations as JSON objects.
|
||||
|
||||
**`--no-order`** (default: `false`)
|
||||
When multiple input files are provided, indicates that there is no ordering
|
||||
relationship among them.
|
||||
|
||||
**`--skip-empty`** (default: `false`)
|
||||
Suppress sequences of length zero from the output.
|
||||
|
||||
**`--solexa`** (default: `false`, env: `OBISOLEXA`)
|
||||
Decode quality strings according to the Solexa specification rather than the
|
||||
standard Phred encoding.
|
||||
|
||||
**`--u-to-t`** (default: `false`)
|
||||
Convert uracil (U) to thymine (T) in all input sequences, normalising RNA to
|
||||
DNA representation.
|
||||
|
||||
## Output Options
|
||||
|
||||
**`--compress|-Z`** (default: `false`)
|
||||
Compress output using gzip.
|
||||
|
||||
**`--fasta-output`** (default: `false`)
|
||||
Write output in FASTA format (default when no quality scores are available).
|
||||
|
||||
**`--fastq-output`** (default: `false`)
|
||||
Write output in FASTQ format (default when quality scores are present).
|
||||
|
||||
**`--json-output`** (default: `false`)
|
||||
Write output in JSON format.
|
||||
|
||||
**`--out|-o <FILENAME>`** (default: `"-"`)
|
||||
Write output to the specified file instead of standard output.
|
||||
|
||||
**`--output-OBI-header|-O`** (default: `false`)
|
||||
Write FASTA/FASTQ title line annotations in OBI format.
|
||||
|
||||
**`--output-json-header`** (default: `false`)
|
||||
Write FASTA/FASTQ title line annotations in JSON format.
|
||||
|
||||
## Taxonomy Options
|
||||
|
||||
**`--fail-on-taxonomy`** (default: `false`)
|
||||
Cause `obiuniq` to exit with an error if a taxid in the data is not a
|
||||
currently valid taxon in the loaded taxonomy.
|
||||
|
||||
**`--raw-taxid`** (default: `false`)
|
||||
Print taxids in output without supplementary information (taxon name and rank).
|
||||
|
||||
**`--taxonomy|-t <string>`** (default: `""`)
|
||||
Path to the taxonomy database used to validate or update taxids.
|
||||
|
||||
**`--update-taxid`** (default: `false`)
|
||||
Automatically replace merged taxids with the most recent valid taxid.
|
||||
|
||||
**`--with-leaves`** (default: `false`)
|
||||
When taxonomy is extracted from a sequence file, add sequences as leaves of
|
||||
their taxid annotation.
|
||||
|
||||
## Execution Options
|
||||
|
||||
**`--max-cpu <int>`** (default: `16`, env: `OBIMAXCPU`)
|
||||
Number of parallel threads used to compute the result.
|
||||
|
||||
**`--debug`** (default: `false`, env: `OBIDEBUG`)
|
||||
Enable debug mode by setting the log level to debug.
|
||||
|
||||
**`--no-progressbar`** (default: `false`)
|
||||
Disable the progress bar.
|
||||
|
||||
**`--silent-warning`** (default: `false`, env: `OBIWARNING`)
|
||||
Suppress warning messages.
|
||||
|
||||
**`--pprof`** (default: `false`)
|
||||
Enable the pprof profiling server (address logged at startup).
|
||||
|
||||
**`--pprof-goroutine <int>`** (default: `6060`, env: `OBIPPROFGOROUTINE`)
|
||||
Port for the goroutine blocking profile endpoint.
|
||||
|
||||
**`--pprof-mutex <int>`** (default: `10`, env: `OBIPPROFMUTEX`)
|
||||
Rate for the mutex contention profile.
|
||||
|
||||
**`--version`** (default: `false`)
|
||||
Print the version string and exit.
|
||||
|
||||
**`--help|-h|-?`** (default: `false`)
|
||||
Print usage information and exit.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Dereplicate a FASTQ file of amplicon reads; write unique sequences to a FASTA output file.
|
||||
obiuniq reads.fastq -o out_basic.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_basic.fastq`.
|
||||
|
||||
```bash
|
||||
# Dereplicate keeping sequences separate per sample (category attribute),
|
||||
# and discard singletons to remove likely sequencing errors.
|
||||
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 2 sequences written to `out_no_singleton.fastq`.
|
||||
|
||||
```bash
|
||||
# Dereplicate per sample, recording the sample distribution in 'merged_sample',
|
||||
# and use 'UNKNOWN' for reads missing the sample attribute.
|
||||
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 5 sequences written to `out_merge.fastq`.
|
||||
|
||||
```bash
|
||||
# Process a dataset entirely in memory using 200 internal partitions,
|
||||
# writing gzip-compressed output.
|
||||
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_inmemory.fastq.gz`.
|
||||
|
||||
```bash
|
||||
# Dereplicate reads from two sample files with no assumed ordering between them,
|
||||
# grouping by both sample and primer attributes.
|
||||
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_multifile.fastq`.
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
- `obigrep` — filter dereplicated sequences by abundance, length, or annotation
|
||||
- `obiannotate` — add or modify annotations on dereplicated records
|
||||
- `obicount` — count sequences or groups in a dataset
|
||||
- `obiclean` — remove sequencing artefacts from a dereplicated dataset
|
||||
- `obisummary` — summarise annotation distributions across a sequence set
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
For datasets that do not fit in RAM, `obiuniq` uses temporary disk-backed
|
||||
chunk files by default. The number of chunks is controlled by `--chunk-count`
|
||||
(default 100). Increasing this value lowers per-chunk memory requirements;
|
||||
decreasing it reduces I/O at the cost of higher peak memory. Use `--in-memory`
|
||||
only when the full working set fits in available RAM, as exceeding memory will
|
||||
degrade performance or cause out-of-memory failures.
|
||||
|
||||
Singletons (sequences with abundance = 1) are a common source of noise in
|
||||
amplicon sequencing, often arising from PCR or sequencing errors. The
|
||||
`--no-singleton` flag is therefore recommended for most metabarcoding
|
||||
workflows, unless the study design requires retaining all observed variants.
|
||||
|
||||
When the `--category-attribute` option is used, records that lack the
|
||||
specified attribute are grouped together under the `--na-value` placeholder
|
||||
(default `"NA"`). This ensures that all records participate in dereplication
|
||||
without being silently dropped, but users should be aware that heterogeneous
|
||||
records with different missing attributes may be unintentionally merged.
|
||||
Reference in New Issue
Block a user