⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+300
View File
@@ -0,0 +1,300 @@
# NAME
obicomplement — reverse complement of sequences
---
# SYNOPSIS
```
obicomplement [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-with <FILENAME>] [--raw-taxid] [--silent-warning]
[--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
[--update-taxid] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obicomplement` computes the reverse complement of every sequence in the
input. For each input sequence, the nucleotides are first reversed, then
each base is replaced by its WatsonCrick complement (A↔T, C↔G), yielding
the strand that would pair with the original sequence read in the opposite
direction.
When quality scores are present (FASTQ data), they are reversed in the same
order as the sequence so that each quality value remains associated with its
corresponding base. Ambiguous IUPAC characters (e.g. `N`, `R`, `Y`) are
handled correctly and preserved in the output.
This operation is commonly needed when sequences have been sequenced on the
wrong strand, when a primer is designed on the reverse strand, or when
preparing sequences for strand-aware downstream analyses.
The command reads from standard input or from one or more files, processes
sequences in parallel, and writes the result to standard output or to the
file specified with `--out`.
---
# INPUT
`obicomplement` accepts biological sequence data in FASTA, FASTQ, EMBL,
GenBank, ecoPCR output, and CSV formats. When no format flag is given, the
format is inferred automatically from the file contents or extension.
Input is read from standard input when no filename argument is provided, or
from one or more files passed as positional arguments. Gzip-compressed files
are handled transparently.
Paired-end data can be provided with `--paired-with`, which specifies the
file containing the second mate. Both mates are reverse-complemented and
written to separate output files.
---
# OUTPUT
The output is a sequence file in which every sequence is the reverse
complement of the corresponding input sequence. The output format matches
the input by default (FASTA if no quality data, FASTQ if quality data are
present), and can be overridden with `--fasta-output`, `--fastq-output`, or
`--json-output`.
All annotations (attributes stored in the sequence header) are preserved
unchanged. Quality scores, when present, are reversed to stay aligned with
their bases.
## Observed output example
```
>seq001 {"definition":"basic DNA sequence"}
cgatcgatcgatcgatcgat
>seq002 {"definition":"GC-rich sequence"}
gcgcgcgcgcgcgcgcgcgc
>seq003 {"definition":"AT-rich sequence"}
atatatatatatatatatat
>seq004 {"definition":"palindromic sequence"}
aattccggaattccggaatt
>seq005 {"definition":"mixed sequence"}
agctagcatgcatagccgat
```
---
# OPTIONS
## Input format
**`--fasta`**
: Default: false. Force parsing of input as FASTA format.
**`--fastq`**
: Default: false. Force parsing of input as FASTQ format.
**`--embl`**
: Default: false. Force parsing of input as EMBL flatfile format.
**`--genbank`**
: Default: false. Force parsing of input as GenBank flatfile format.
**`--ecopcr`**
: Default: false. Force parsing of input as ecoPCR output format.
**`--csv`**
: Default: false. Force parsing of input as CSV format.
**`--solexa`**
: Default: false. Decode quality scores using the Solexa/Illumina pre-1.3
convention instead of the standard Phred+33 encoding.
**`--input-OBI-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using the OBI
key=value format.
**`--input-json-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using JSON
format.
**`--no-order`**
: Default: false. When several input files are given, declare that no
ordering relationship exists among them, allowing the reader to interleave
records freely.
**`--paired-with <FILENAME>`**
: Default: none. File containing the paired (R2) reads. When set,
`obicomplement` processes both mates and writes them to separate output
files.
## Sequence preprocessing
**`--u-to-t`**
: Default: false. Convert Uracil (U) to Thymine (T) before computing the
reverse complement. Useful when processing RNA sequences that must be
treated as DNA.
**`--skip-empty`**
: Default: false. Discard sequences of length zero from the output.
## Output format
**`--fasta-output`**
: Default: false. Write output in FASTA format regardless of whether quality
scores are present.
**`--fastq-output`**
: Default: false. Write output in FASTQ format (requires quality data).
**`--json-output`**
: Default: false. Write output in JSON format.
**`--out|-o <FILENAME>`**
: Default: `-` (standard output). File used to save the output.
**`--output-OBI-header|-O`**
: Default: false. Write FASTA/FASTQ header annotations in OBI key=value
format.
**`--output-json-header`**
: Default: false. Write FASTA/FASTQ header annotations in JSON format.
**`--compress|-Z`**
: Default: false. Compress the output with gzip.
## Taxonomy
**`--taxonomy|-t <string>`**
: Default: none. Path to a taxonomy database. Required only when the input
sequences carry taxid annotations that need to be validated or updated.
**`--fail-on-taxonomy`**
: Default: false. Cause `obicomplement` to exit with an error if a taxid
referenced in the data is not a currently valid node in the loaded
taxonomy.
**`--update-taxid`**
: Default: false. Automatically replace taxids that have been declared
merged into a newer node by the taxonomy database.
**`--raw-taxid`**
: Default: false. Print taxids without appending the taxon name and rank.
**`--with-leaves`**
: Default: false. When the taxonomy is extracted from the sequence file,
attach sequences as leaves of their taxid node.
## Performance and diagnostics
**`--max-cpu <int>`**
: Default: 16 (env: `OBIMAXCPU`). Number of parallel threads used to
process sequences.
**`--batch-size <int>`**
: Default: 1 (env: `OBIBATCHSIZE`). Minimum number of sequences per
processing batch.
**`--batch-size-max <int>`**
: Default: 2000 (env: `OBIBATCHSIZEMAX`). Maximum number of sequences per
processing batch.
**`--batch-mem <string>`**
: Default: `128M` (env: `OBIBATCHMEM`). Maximum memory allocated per batch
(e.g. `128K`, `64M`, `1G`). Set to `0` to disable the memory limit.
**`--no-progressbar`**
: Default: false. Disable the progress bar printed to stderr.
**`--silent-warning`**
: Default: false (env: `OBIWARNING`). Suppress warning messages.
**`--debug`**
: Default: false (env: `OBIDEBUG`). Enable debug logging.
---
# EXAMPLES
```bash
# Reverse complement all sequences in a FASTA file
obicomplement sequences.fasta > out_default.fasta
```
**Expected output:** 5 sequences written to `out_default.fasta`.
```bash
# Reverse complement a FASTQ file, preserving quality scores
obicomplement reads.fastq --fastq-output --out out_fastq.fastq
```
**Expected output:** 5 sequences written to `out_fastq.fastq`.
```bash
# Convert RNA sequences to their reverse complement DNA strand
obicomplement --u-to-t rna_sequences.fasta > out_rna_rc.fasta
```
**Expected output:** 3 sequences written to `out_rna_rc.fasta`.
```bash
# Reverse complement paired-end reads into two separate output files
obicomplement R1.fastq --paired-with R2.fastq --out out_paired.fastq
```
**Expected output:** 3 sequences written to `out_paired_R1.fastq` and 3 sequences to `out_paired_R2.fastq`.
```bash
# Reverse complement and compress output, skipping any empty sequences
obicomplement --skip-empty --compress sequences.fasta --out out_compressed.fasta.gz
```
**Expected output:** 5 sequences written to `out_compressed.fasta.gz` (gzip-compressed FASTA).
```bash
# Reverse complement with OBI-format header output
obicomplement --output-OBI-header sequences.fasta --out out_obi.fasta
```
**Expected output:** 5 sequences written to `out_obi.fasta`.
```bash
# Reverse complement with explicit JSON-format header output
obicomplement --output-json-header sequences.fasta --out out_jsonheader.fasta
```
**Expected output:** 5 sequences written to `out_jsonheader.fasta`.
```bash
# Reverse complement and write full JSON output format
obicomplement --json-output sequences.fasta --out out_json.json
```
**Expected output:** 5 sequences written to `out_json.json`.
---
# SEE ALSO
- `obiconvert` — format conversion and sequence filtering pipeline
- `obipairing` — paired-end read merging (uses reverse complement internally)
- `obigrep` — sequence filtering and selection
---
# NOTES
Quality scores (Phred-scaled) are reversed in lock-step with the sequence
so that positional quality information remains valid after the reverse
complement operation. This is essential for downstream tools that rely on
per-base quality for alignment or variant calling.
Ambiguous IUPAC characters and gap symbols (`-`) are handled gracefully:
standard ambiguous bases are complemented according to IUPAC rules, while
gap and missing-data symbols are preserved unchanged.
+188
View File
@@ -0,0 +1,188 @@
# obiconsensus(1) — OBITools4 Manual
## NAME
`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
## SYNOPSIS
```
obiconsensus [OPTIONS] [FILE...]
```
## DESCRIPTION
`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.
The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.
Two denoising strategies are available:
- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.
## INPUT FORMATS
`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## DENOISING OPTIONS
`--distance INT`, `-d INT`
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
`--cluster`, `-C`
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
`--kmer-size SIZE`
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.
`--no-singleton`
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
`--low-coverage FLOAT`
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
`--sample ATTRIBUTE`, `-s ATTRIBUTE`
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.
## OUTPUT ANNOTATION OPTIONS
`--unique`, `-U`
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.
`--save-graph DIRECTORY`
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
`--save-ratio FILE`
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
## PERFORMANCE OPTIONS
`--max-cpu INT`
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
`--batch-size INT`
: Minimum number of sequences processed together in a single batch. Default: 1.
`--batch-size-max INT`
: Maximum number of sequences in a single batch. Default: 2000.
`--batch-mem STRING`
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.
`--no-progressbar`
: Disable the progress bar.
`--no-order`
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
## OTHER OPTIONS
`--u-to-t`
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
`--skip-empty`
: Remove sequences of length zero from the output.
`--solexa`
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
`--silent-warning`
: Suppress warning messages.
`--debug`
: Enable detailed logging for troubleshooting.
`--version`
: Print the version number and exit.
`--help`, `-h`
: Display a brief help message and exit.
## OUTPUT ATTRIBUTES
Each output consensus sequence carries several annotation attributes describing how it was built:
| Attribute | Description |
|-----------|-------------|
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
| `count` | Total number of reads merged into this consensus across all samples |
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
| `seq_length` | Length of the consensus sequence |
## EXAMPLES
**Basic denoising of a FASTQ file:**
```sh
obiconsensus reads.fastq > denoised.fastq
```
**Increase the allowed distance between reads to 2:**
```sh
obiconsensus --distance 2 reads.fastq > denoised.fastq
```
**Use clustering mode and remove singletons:**
```sh
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
```
**Denoise, then dereplicate the output:**
```sh
obiconsensus --unique reads.fastq > denoised_uniq.fastq
```
**Save denoising graphs for inspection:**
```sh
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
```
**Specify the sample annotation attribute:**
```sh
obiconsensus --sample library reads.fastq > denoised.fastq
```
## SEE ALSO
`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)
## NOTES
`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.
The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
+179
View File
@@ -0,0 +1,179 @@
# NAME
obiconvert — convertion of sequence files to various formats
---
# SYNOPSIS
```
obiconvert [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header] [--paired-with <FILENAME>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--raw-taxid]
[--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
obiconvert is a versatile command-line tool that converts biological sequence data between multiple standard bioinformatics formats. It enables biologists to process large datasets by reading from one format and writing to another, with support for quality scores, taxonomic annotations, and various input/output combinations. The tool is optimized for high-performance processing with configurable batching, parallel execution, and memory management.
Biologists use obiconvert to standardize sequence data for compatibility with different bioinformatics tools, extract quality information from FASTQ files into more readable formats, or convert between FASTA and FASTQ when working with DNA/RNA sequences that have associated quality data. The tool automatically detects input formats and intelligently selects output formats based on data presence (e.g., FASTQ when quality scores exist, FASTA otherwise). To force a specific output format regardless of input content, use the explicit output flags (`--fasta-output`, `--fastq-output`, `--json-output`). <!-- corrected: without --fasta-output, a FASTQ input with quality scores stays in FASTQ format even when the output filename has a .fasta extension -->
---
# INPUT
obiconvert accepts input in multiple biological sequence formats:
- **FASTA**: Standard text-based format with `>` headers and sequence data
- **FASTQ**: Binary quality-score format (default when both sequence and quality data present)
- **GenBank**: Comprehensive biological record format with annotations
- **EMBL**: EMBL flatfile format for sequence and feature information
- **ecoPCR**: Specialized output format from ecoPCR amplification tools
- **CSV**: Tabular sequence data with configurable delimiters
Input is provided as positional arguments (file paths or `-` for stdin). The tool automatically detects the input format based on file content and can handle multiple files in sequence. When paired-end sequencing is used, the `--paired-with` option specifies the mate read file.
---
# OUTPUT
obiconvert produces sequence data in several output formats depending on input content and user selection:
- **FASTA**: Text format with sequence only (use `--fasta-output` to force)
- **FASTQ**: Format including quality scores (default when quality data present; use `--fastq-output` to force)
- **JSON**: Structured output with all sequence metadata and attributes (use `--json-output`)
The tool preserves all sequence annotations (taxonomic information, custom attributes) during conversion. When converting to FASTA/FASTQ formats, title line annotations can be formatted as OBI structured data or JSON using the `--output-OBI-header`/`--output-json-header` flags. Sequences of zero length can be suppressed with `--skip-empty`.
## Observed output example
```
>seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002 {"definition":"Second sequence with moderate quality scores"}
gctagctagctagctagctagctagctagctagctagct
>seq003 {"definition":"Third sequence with high quality scores"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
>seq004 {"definition":"Fourth sequence with variable quality scores"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
```
---
# OPTIONS
## Input Format Options
- **--fasta**: Read data following the fasta format. (default: false)
- **--fastq**: Read data following the fastq format. (default: false)
- **--genbank**: Read data following the Genbank flatfile format. (default: false)
- **--embl**: Read data following the EMBL flatfile format. (default: false)
- **--ecopcr**: Read data following the ecoPCR output format. (default: false)
- **--csv**: Read data following the CSV format. (default: false)
## Input Header Options
- **--input-OBI-header**: FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--input-json-header**: FASTA/FASTQ title line annotations follow json format. (default: false)
## Output Format Options
- **--fasta-output**: Write sequence in fasta format (default if no quality data available). (default: false)
- **--fastq-output**: Write sequence in fastq format (default if quality data available). (default: false)
- **--json-output**: Write sequence in json format. (default: false)
## Output Header Options
- **--output-OBI-header|-O**: output FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--output-json-header**: output FASTA/FASTQ title line annotations follow json format. (default: false)
## Processing Options
- **--skip-empty**: Sequences of length equal to zero are suppressed from the output (default: false)
- **--no-order**: When several input files are provided, indicates that there is no order among them. (default: false)
- **--u-to-t**: Convert Uracil to Thymine. (default: false)
- **--update-taxid**: Make obitools automatically updating the taxid that are declared merged to a newest one. (default: false)
- **--raw-taxid**: When set, taxids are printed in files with any supplementary information (taxon name and rank) (default: false)
- **--fail-on-taxonomy**: Make obitools failing on error if a used taxid is not a currently valid one (default: false)
- **--with-leaves**: If taxonomy is extracted from a sequence file, sequences are added as leave of their taxid annotation (default: false)
## File Options
- **--out|-o <FILENAME>**: Filename used for saving the output (default: "-")
- **--paired-with <FILENAME>**: Filename containing the paired reads (default: "")
## Performance Options
- **--batch-mem <string>**: Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- **--batch-size <int>**: Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- **--batch-size-max <int>**: Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- **--max-cpu <int>**: Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
- **--compress|-Z**: Compress all the result using gzip (default: false)
## Debug Options
- **--debug**: Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- **--silent-warning**: Stop printing of the warning message (default: false, env: OBIWARNING)
- **--no-progressbar**: Disable the progress bar printing (default: false)
## Profiling Options
- **--pprof**: Enable pprof server. Look at the log for details. (default: false)
- **--pprof-goroutine <int>**: Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- **--pprof-mutex <int>**: Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
## Utility Options
- **--taxonomy|-t <string>**: Path to the taxonomy database. (default: "")
- **--solexa**: Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- **--help|-h|-?**: Show help message (default: false)
- **--version**: Prints the version and exits. (default: false)
---
# EXAMPLES
## Convert FASTQ to FASTA
```bash
# Convert quality-score data from FASTQ to readable FASTA format
obiconvert --fastq --fasta-output input.fastq -o output.fasta
```
**Expected output:** 4 sequences written to `output.fasta`.
## Convert FASTA to JSON
```bash
# Convert sequences to structured JSON format preserving all annotations
obiconvert --fasta --json-output input.fasta -o output.json
```
**Expected output:** 3 sequences written to `output.json`.
## Process paired-end sequencing data
```bash
# Convert paired FASTQ files preserving read pairing
obiconvert --fastq --fasta-output forward.fastq --paired-with reverse.fastq -o merged_sequences.fasta
```
**Expected output:** 4 sequences written to `merged_sequences_R1.fasta` and `merged_sequences_R2.fasta`.
---
# SEE ALSO
- obiannotate: Add taxonomic and functional annotations to sequences
- obicount: Count sequences in files
- obigrep: Filter sequences based on attributes or patterns
- obisummary: Generate statistics from sequence files
- obiuniq: Remove duplicate sequences
---
# NOTES
obiconvert automatically selects the optimal output format based on input data presence, preferring FASTQ when quality scores are available and FASTA otherwise. To force a specific output format, use `--fasta-output`, `--fastq-output`, or `--json-output` explicitly. <!-- corrected: the output format is NOT determined by the output filename extension; it must be forced with explicit flags -->
Memory usage is controlled through batch processing, with configurable memory limits per batch to handle large datasets efficiently. Progress reporting can be disabled for scripting purposes using `--no-progressbar`.
When working with taxonomic data, ensure the taxonomy database is accessible and properly formatted to avoid failures during sequence annotation processing.
+190
View File
@@ -0,0 +1,190 @@
# NAME
obicount — counts the sequences present in a file of sequences
---
# SYNOPSIS
```
obicount [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq]
[--genbank] [--help|-h|-?] [--input-OBI-header]
[--input-json-header] [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--reads|-r]
[--silent-warning] [--solexa] [--symbols|-s] [--u-to-t]
[--variants|-v] [--version] [<args>]
```
---
# DESCRIPTION
obicount is a command-line tool designed to count biological sequences from various input formats. It helps biologists quickly obtain quantitative metrics about sequence collections, which is essential for quality control, data assessment, and pipeline monitoring. The tool can count reads (total sequences), variants (unique sequence strings), or symbols (sum of character lengths), providing flexibility to focus on specific aspects of sequence data depending on the analysis needs.
---
# INPUT
obicount accepts input from files or stdin, supporting multiple biological sequence formats:
- FASTA (.fasta[.gz])
- FASTQ (.fastq[.fq][.gz])
- GenBank/EMBL (.gb|.gbff|.dat[.gz])
- ecoPCR format (.ecopcr[.gz])
- CSV format (--csv flag)
Input can be provided as multiple filenames or read from stdin. The tool automatically detects file formats and parses sequences accordingly.
---
# OUTPUT
obicount outputs one or more of the following metrics, depending on the flags used:
- **Read counts**: Total number of sequences in the input
- **Variant counts**: Number of unique sequence strings (distinct sequences)
- **Symbol counts**: Sum of all character lengths across all sequences
When no specific counting flags are provided (-r, -v, -s), all three metrics are reported by default. Output is printed to stdout in CSV format with headers: `entities,n` for the type of entity counted, followed by the count value.
---
# OPTIONS
## General Options
- --help|-h|-?
Show help message and exit.
- --max-cpu <int>
Number of parallel threads computing the result (default: 16, env: OBIMAXCPU).
- --debug
Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- --silent-warning
Stop printing of the warning message (default: false, env: OBIWARNING)
## Input Format Options
- --fasta
Read data following the fasta format. (default: false)
- --fastq
Read data following the fastq format. (default: false)
- --genbank
Read data following the Genbank flatfile format. (default: false)
- --embl
Read data following the EMBL flatfile format. (default: false)
- --ecopcr
Read data following the ecoPCR output format. (default: false)
- --csv
Read data following the CSV format. (default: false)
## Input Header Options
- --input-OBI-header
FASTA/FASTQ title line annotations follow OBI format. (default: false)
- --input-json-header
FASTA/FASTQ title line annotations follow json format. (default: false)
## Counting Mode Options
- --reads|-r
Prints read counts. (default: false)
- --variants|-v
Prints variant counts. (default: false)
- --symbols|-s
Prints symbol counts. (default: false)
## Processing Options
- --u-to-t
Convert Uracil to Thymine. (default: false, env: OBISOLEXA)
- --solexa
Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- --no-order
When several input files are provided, indicates that there is no order among them. (default: false)
## Performance Options
- --batch-mem <string>
Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- --batch-size <int>
Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- --batch-size-max <int>
Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- --max-cpu <int>
Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
## Profiling Options
- --pprof
Enable pprof server. Look at the log for details. (default: false)
- --pprof-goroutine <int>
Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- --pprof-mutex <int>
Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
- --version
Prints the version and exits. (default: false)
---
# EXAMPLES
# Count total number of sequences in a FASTA file
# Useful for quick assessment of dataset size
obicount input.fasta
**Expected output:** 4 sequences, out_default.txt
# Count only the number of unique sequence variants
# Helpful for identifying genetic diversity in population data
obicount --variants input.fasta
**Expected output:** 4 sequences, out_variants.txt
# Count sum of all sequence symbol lengths (nucleotides/amino acids)
# Useful for estimating total data volume or computing average read length
obicount --symbols input.fasta
**Expected output:** 4 sequences, out_symbols.txt
# Count reads from FASTQ format with quality scores
# Essential for assessing read throughput in sequencing data
obicount --fastq --reads input.fastq
**Expected output:** 4 sequences, out_fastq_reads.txt
---
# OUTPUT
## Observed output example
```
time="2026-04-02T19:33:11+02:00" level=info msg="Number of workers set 16"
time="2026-04-02T19:33:11+02:00" level=info msg="Found 1 files to process"
time="2026-04-02T19:33:11+02:00" level=info msg="input.fasta mime type: text/fasta"
entities,n
variants,5
reads,5
symbols,435
```
---
# SEE ALSO
- obiconvert - Convert between biological sequence file formats
- obiuniq - Remove duplicate sequences from files
---
# NOTES
_(not available)_
+315
View File
@@ -0,0 +1,315 @@
# NAME
obicsv — converts sequence files to CSV format
---
# SYNOPSIS
```
obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
[--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
[--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
[--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
[--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
[--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
[--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
obicsv converts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.
Columns must be explicitly selected: use `--ids` for the identifier, `--sequence` for the nucleotide sequence, `--quality` for quality scores, `--taxon` for taxonomic information, `--auto` to auto-detect annotation attributes, or `--keep` for specific named attributes. Multiple flags can be combined freely.
The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.
---
# INPUT
obicsv accepts input from files or stdin. The input format is automatically detected based on the file extension, but can be explicitly specified using format flags.
Supported input formats:
- FASTA (`--fasta`)
- FASTQ (`--fastq`)
- GenBank (`--genbank`)
- EMBL (`--embl`)
- ecoPCR output (`--ecopcr`)
- CSV (`--csv`)
Input sources:
- Local files (specified as arguments)
- stdin (when no input file is provided)
- Remote URLs (`http://`, `https://`, `ftp://`)
- Directories (automatically scanned for valid files)
Header formats:
- OBI format (`--input-OBI-header`)
- JSON format (`--input-json-header`)
- Auto-detection (default)
Taxonomy database can be provided with `--taxonomy|-t`.
---
# OUTPUT
The output is a CSV file with one row per sequence. The columns included depend on the flags used:
| Column | Flag | Description |
|--------|------|-------------|
| id | `--ids\|-i` | Sequence identifier |
| sequence | `--sequence\|-s` | DNA/RNA sequence |
| qualities | `--quality\|-q` | Quality scores (ASCII-encoded) |
| definition | `--definition\|-d` | Sequence description/annotation |
| count | `--count` | Number of reads represented by this sequence |
| taxid | `--taxon` | NCBI taxonomy identifier |
| scientific_name | `--taxon` | Taxonomic scientific name |
| custom attributes | `--keep\|-k` | Any attribute stored in sequence annotations |
If `--auto` is used, columns are automatically determined based on the attributes present in the first batch of sequences.
Missing values are written as the NA value (default: "NA").
## Observed output example
```csv
id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
```
---
# OPTIONS
## Output Columns
These flags control which columns appear in the CSV output.
- **`--ids|-i`**
- Default: `false`
- Meaning: Include the sequence identifier column. Useful for tracking or linking sequences.
- **`--sequence|-s`**
- Default: `false`
- Meaning: Include the nucleotide or amino acid sequence. This is the main biological data.
- **`--quality|-q`**
- Default: `false`
- Meaning: Include quality scores for each position. Essential for quality control and filtering.
- **`--definition|-d`**
- Default: `false`
- Meaning: Include the sequence description or definition from the source file.
- **`--count`**
- Default: `false`
- Meaning: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
- **`--taxon`**
- Default: `false`
- Meaning: Include taxonomic information. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see `--taxonomy`).
- **`--obipairing`**
- Default: `false`
- Meaning: Include attributes that were added by the `obipairing` command (pairing scores, mismatches, etc.).
- **`--auto`**
- Default: `false`
- Meaning: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with `--ids`, `--sequence`, etc. to add those columns on top of the auto-detected ones.
- **`--keep|-k <KEY>`**
- Default: `none`
- Meaning: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations.
- **`--na-value <NAVALUE>`**
- Default: `"NA"`
- Meaning: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, "NA", "null").
## Input/Output Files
- **`--out|-o <FILENAME>`**
- Default: `"-"` (stdout)
- Meaning: Write output to the specified file instead of stdout.
- **`--compress|-Z`**
- Default: `false`
- Meaning: Compress the output using gzip.
## Input Format
- **`--fasta`**, **`--fastq`**, **`--genbank`**, **`--embl`**, **`--ecopcr`**, **`--csv`**
- Default: auto-detection
- Meaning: Explicitly specify the input format.
- **`--input-OBI-header`**, **`--input-json-header`**
- Default: auto-detection
- Meaning: Specify the header format in FASTA/FASTQ files (OBI or JSON annotations).
- **`--u-to-t`**
- Default: `false`
- Meaning: Convert Uracil to Thymine. Useful for RNA sequences.
- **`--solexa`**
- Default: `false`
- Meaning: Decode quality strings according to the Solexa specification instead of Phred.
## Taxonomy
- **`--taxonomy|-t <string>`**
- Default: `""`
- Meaning: Path to the taxonomy database directory. Required for `--taxon` output.
- **`--fail-on-taxonomy`**
- Default: `false`
- Meaning: Make OBITools fail if a used taxid is not currently valid.
- **`--update-taxid`**
- Default: `false`
- Meaning: Automatically update taxids that have been merged to their newest valid taxid.
- **`--raw-taxid`**
- Default: `false`
- Meaning: Print only taxids without supplementary information (name and rank).
- **`--with-leaves`**
- Default: `false`
- Meaning: Add sequences as leaves of their taxid annotation when taxonomy is extracted from a sequence file.
## Performance
- **`--max-cpu <int>`**
- Default: `16`
- Meaning: Number of parallel threads for processing.
- **`--batch-size <int>`**
- Default: `1`
- Meaning: Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
- Default: `2000`
- Meaning: Maximum number of sequences per batch.
- **`--batch-mem <string>`**
- Default: `"128M"`
- Meaning: Maximum memory per batch (e.g., 128K, 64M, 1G).
- **`--no-order`**
- Default: `false`
- Meaning: When multiple input files are provided, indicates there is no order among them.
- **`--no-progressbar`**
- Default: `false`
- Meaning: Disable the progress bar.
## Other Options
- **`--debug`**
- Default: `false`
- Meaning: Enable debug mode by setting log level to debug.
- **`--pprof`**
- Default: `false`
- Meaning: Enable pprof server.
- **`--pprof-goroutine <int>`**
- Default: `6060`
- Meaning: Enable profiling of goroutine blocking.
- **`--pprof-mutex <int>`**
- Default: `10`
- Meaning: Enable profiling of mutex lock.
- **`--silent-warning`**
- Default: `false`
- Meaning: Suppress warning messages.
- **`--version`**
- Default: `false`
- Meaning: Print version information and exit.
- **`--help|-h|-?`**
- Default: `false`
- Meaning: Print help information.
---
# EXAMPLES
**Export sequences with identifiers to CSV**
Extracts sequence IDs and sequences from a FASTQ file.
```bash
obicsv --ids --sequence sequences.fastq -o output1.csv
```
**Expected output:** 3 sequences written to `output1.csv`.
**Export sequences with quality scores**
Useful for quality control and filtering in downstream tools.
```bash
obicsv --ids --sequence --quality sequences.fastq -o output2.csv
```
**Expected output:** 3 sequences written to `output2.csv`.
**Export with taxonomic information**
Includes taxid and scientific name for taxonomic analysis.
```bash
obicsv --ids --sequence --taxon --taxonomy /path/to/taxonomy sequences.fasta -o output.csv
```
**Auto-detect annotation columns from sequence headers**
Automatically discovers all annotation attributes present in the sequence headers and outputs them as CSV columns. Combined with `--ids` to also include the sequence identifier.
```bash
obicsv --auto --ids sequences.fasta -o output4.csv
```
**Expected output:** 3 rows in `output4.csv` with columns `id`, `sample`, `taxid` (attributes found in sequence headers).
**Extract specific attributes**
Keeps only the specified attributes as columns. Attributes not present in a sequence are written as the NA value.
```bash
obicsv --keep sample --keep taxid sequences.fasta -o output5.csv
```
**Expected output:** 3 rows in `output5.csv` with columns `taxid`, `sample`.
**Export with compression**
Writes gzip-compressed CSV output for large datasets.
```bash
obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz
```
**Expected output:** 3 sequences written to `output6.csv.gz`.
---
# SEE ALSO
- `obiconvert` — input/output handling framework
- `obipairing` — pairing information (used with `--obipairing`)
- Other export commands: `obifasta`, `obifastq`, `obijson`
---
# NOTES
- Without any column selection flag (`--ids`, `--sequence`, `--quality`, `--taxon`, `--auto`, `--keep`), the output contains no columns and no data.
- The `--taxon` option requires a valid taxonomy database specified with `--taxonomy`.
- Output is written to stdout by default; use `--out` to write to a file.
- Missing attributes are written as the NA value (customizable with `--na-value`).
- Input sequences are processed using streaming iterators to minimize memory footprint, even for large files.
+321
View File
@@ -0,0 +1,321 @@
# obidemerge
## NAME
`obidemerge` — split merged sequence records back into individual, sample-annotated copies
## SYNOPSIS
```
obidemerge [options] [input_files...]
```
## DESCRIPTION
In a typical metabarcoding workflow, `obiuniq` or similar tools collapse identical sequences
from multiple samples into a single representative record. That record carries a statistics
attribute (for example `merged_sample`) that stores, for every original sample, how many
times the sequence was observed. This compact representation is convenient for clustering
and denoising, but some downstream analyses need the original, per-sample view.
`obidemerge` reverses that merging step. For each input sequence, it reads the statistics
stored under a chosen attribute (by default `sample`) and produces one output sequence per
entry in that statistics map. Each output sequence is a copy of the original, but:
- its `sample` attribute (or whichever slot you chose) is set to the name of the individual
sample,
- its read count is set to the abundance recorded for that sample.
The original statistics attribute is removed from all output sequences.
Sequences that carry no statistics for the chosen slot are passed through unchanged.
The command reads sequences from one or more files, or from standard input when no file is
given, and writes the results to standard output or to the file specified with `--out`.
## INPUT FORMATS
`obidemerge` accepts all sequence formats supported by OBITools4:
| Format | Description |
|--------|-------------|
| FASTA | Plain nucleotide sequences with annotation in the title line |
| FASTQ | Sequences with per-base quality scores |
| EMBL | European Nucleotide Archive flat-file format |
| GenBank | NCBI GenBank flat-file format |
| ecoPCR | Output produced by the ecoPCR tool |
| CSV | Comma-separated values with sequence and metadata columns |
The format is detected automatically from the file extension or content. You can override
detection with the format flags listed under **Input format options** below.
Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style
(`--input-OBI-header`) or JSON style (`--input-json-header`).
## OUTPUT FORMATS
By default, the output format mirrors the input:
- If the input contains quality scores, output is FASTQ.
- Otherwise, output is FASTA with OBI-style annotations.
You can force a specific format with `--fasta-output`, `--fastq-output`, or `--json-output`.
## OPTIONS
### Demerge option
`--demerge <slot>`, `-d <slot>`
: Name of the sequence attribute that holds the per-sample statistics to expand.
Each key in that statistics map becomes a separate output sequence.
**Default:** `sample`
### Output options
`--out <FILENAME>`, `-o <FILENAME>`
: Write output to this file instead of standard output. Use `-` for standard output.
**Default:** `-` (standard output)
`--fasta-output`
: Write output in FASTA format, even when quality scores are available.
**Default:** false
`--fastq-output`
: Write output in FASTQ format (requires quality scores in the input).
**Default:** false
`--json-output`
: Write output in JSON format, one record per line.
**Default:** false
`--output-OBI-header`, `-O`
: Write FASTA/FASTQ title lines in OBI key=value annotation style.
**Default:** false (JSON-style headers)
`--output-json-header`
: Write FASTA/FASTQ title lines in JSON annotation style.
**Default:** false
`--compress`, `-Z`
: Compress the output with gzip.
**Default:** false
`--skip-empty`
: Discard sequences of length zero from the output.
**Default:** false
### Input format options
`--fasta`
: Force reading in FASTA format.
`--fastq`
: Force reading in FASTQ format.
`--embl`
: Force reading in EMBL flat-file format.
`--genbank`
: Force reading in GenBank flat-file format.
`--ecopcr`
: Force reading in ecoPCR output format.
`--csv`
: Force reading in CSV format.
`--input-OBI-header`
: Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
`--input-json-header`
: Parse FASTA/FASTQ title lines as JSON annotations.
`--solexa`
: Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard
Phred scale. Use this only for very old sequencing data.
**Default:** false
`--u-to-t`
: Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived
data that should be treated as DNA.
**Default:** false
`--no-order`
: When reading from several input files, do not attempt to preserve the order of records
across files. May improve speed when order does not matter.
**Default:** false
### Taxonomy options
`--taxonomy <path>`, `-t <path>`
: Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to
be resolved or validated during output.
**Default:** none
`--fail-on-taxonomy`
: Stop with an error if a taxonomic identifier in the data is not found in the loaded
taxonomy database.
**Default:** false
`--raw-taxid`
: Print taxonomic identifiers as plain numbers, without appending the taxon name and rank.
**Default:** false
`--update-taxid`
: Automatically replace deprecated taxonomic identifiers with their current equivalents,
as declared in the taxonomy database.
**Default:** false
`--with-leaves`
: When a taxonomy is extracted from the sequence file itself, treat each sequence as a
leaf node under its annotated taxonomic identifier.
**Default:** false
### Performance options
`--max-cpu <int>`
: Maximum number of parallel processing threads. Increase for faster processing on
multi-core machines.
**Default:** 16 (or the value of the `OBIMAXCPU` environment variable)
`--batch-size <int>`
: Minimum number of sequences processed together as a group.
**Default:** 1
`--batch-size-max <int>`
: Maximum number of sequences processed together as a group.
**Default:** 2000
`--batch-mem <size>`
: Maximum memory used per processing group (e.g. `64M`, `1G`). Set to `0` to disable the
memory limit and rely on `--batch-size-max` alone.
**Default:** `128M`
### Display options
`--no-progressbar`
: Hide the progress bar.
**Default:** false
`--silent-warning`
: Suppress warning messages.
**Default:** false
`--debug`
: Enable verbose debug logging.
**Default:** false
`--version`
: Print the OBITools4 version and exit.
`--help`, `-h`, `-?`
: Print this help message and exit.
## EXAMPLES
### Example 1 — basic demerge using the default slot
After running `obiuniq`, the file `unique.fasta` contains merged sequences whose
`merged_sample` attribute records abundance per sample. Demerge back to one
sequence per sample:
<!-- corrected: -d sample (not -d merged_sample) because HasStatsOn("sample") looks for the merged_sample attribute -->
```bash
obidemerge -d sample unique.fasta > per_sample_merged.fasta
```
**Expected output:** 7 sequences written to `per_sample_merged.fasta`.
### Example 2 — demerge with the default `sample` slot
If the statistics are already stored under the attribute named `sample` (the default),
no `-d` flag is needed:
```bash
obidemerge unique.fasta > per_sample_default.fasta
```
**Expected output:** 7 sequences written to `per_sample_default.fasta`.
### Example 3 — write compressed output to a file
```bash
obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta
```
**Expected output:** 7 sequences written (compressed) to `per_sample.fasta.gz`.
### Example 4 — pipeline use: cluster, then demerge
Obtain unique sequences, cluster them, then expand the clusters back to individual
sample records for ecological analysis:
```bash
obiuniq -m sample reads.fastq \
| obiclean ... \
| obidemerge -d sample -o demerged.fasta
```
### Example 5 — process multiple input files
```bash
obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta
```
**Expected output:** 6 sequences written to `combined_demerged.fasta`.
## SEE ALSO
`obiuniq(1)` — collapses identical sequences and records per-sample counts (the inverse operation)
`obiclean(1)` — removes PCR/sequencing artefacts from a set of unique sequences
`obiannotate(1)` — adds or modifies sequence attributes
`obigrep(1)` — filters sequences by attributes or sequence content
`obicount(1)` — counts sequences and total reads in a file
## NOTES
**Relationship to `obiuniq`.**
`obiuniq --merge sample` stores per-sample counts under an attribute named `merged_sample`.
When you later call `obidemerge`, you must therefore pass `-d sample` to match that
attribute name. The `-d` option takes the **logical** slot name (here `sample`), not the
internal storage name (`merged_sample`).
<!-- corrected: -d sample is correct (not -d merged_sample); the tool prepends "merged_" internally when looking up the attribute -->
**Read counts after demerging.**
Each output sequence has its read count set to the value recorded in the statistics map for
that sample. If you sum the counts of all output sequences that share the same identifier,
you recover the total count of the original merged record.
**Order of output sequences.**
The order in which the per-sample copies of a single merged sequence appear in the output
is not guaranteed. If a stable order is required, pipe the output through `obisort`.
## OUTPUT
`obidemerge` writes one sequence record per sample entry found in the statistics attribute.
Each output record is a copy of the input sequence, with:
- the statistics attribute (`merged_<slot>`) removed,
- the `<slot>` attribute set to the sample name,
- the `count` attribute set to the abundance for that sample.
Sequences with no statistics for the chosen slot are passed through unchanged.
## Observed output example
```
>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc
```
+296
View File
@@ -0,0 +1,296 @@
# NAME
obidistribute — divided an input set of sequences into subsets
---
# SYNOPSIS
```
obidistribute --pattern|-p <string> [--append|-A] [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>]
[--batches|-n <int>] [--classifier|-c <string>] [--compress|-Z]
[--csv] [--debug] [--directory|-d <string>] [--ecopcr] [--embl]
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
[--genbank] [--hash|-H <int>] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--na-value <string>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--silent-warning] [--skip-empty] [--solexa] [--u-to-t]
[--version] [<args>]
```
---
# DESCRIPTION
`obidistribute` splits a set of biological sequences into multiple output files according to one of three distribution strategies: annotation-based classification, round-robin batch assignment, or hash-based sharding.
The most common use case in metabarcoding is demultiplexing: sequences carry a tag annotation (e.g., `sample_id`) and `obidistribute` writes each sample's sequences into its own file. The output filename for each group is built from a user-supplied pattern containing `%s`, which is replaced by the classifier value or batch index.
When no classifier is specified, sequences can be split into a fixed number of batches (`--batches`) for parallel downstream processing, or sharded deterministically by hash (`--hash`) to ensure reproducible partitioning regardless of input order.
Output files can be organised into subdirectories (one per classifier value) using `--directory`, and existing files can be extended rather than overwritten with `--append`. Sequences lacking the classifier annotation are assigned to a file whose name uses the NA value (default: `"NA"`).
---
# INPUT
`obidistribute` reads biological sequences from one or more files supplied as positional arguments, or from standard input when no files are given. All major NGS and flat-file formats are supported and auto-detected:
- FASTA / FASTQ (plain or gzip-compressed)
- GenBank and EMBL flat files
- ecoPCR output
- CSV
Format can be forced with `--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, or `--csv`. Header annotation style can be specified with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Each distribution group produces a separate output file named according to the `--pattern` template. The `%s` placeholder in the pattern is replaced by the classifier value, batch index, or hash shard index, depending on the chosen distribution mode.
Output format follows the same rules as other OBITools commands: FASTQ is used when quality scores are present, FASTA otherwise. The format can be forced with `--fasta-output`, `--fastq-output`, or `--json-output`. All annotations present in the input sequences are preserved in the output files.
When `--directory` is used together with `--classifier`, output files are placed in subdirectories named after the classifier values, allowing hierarchical organisation of results.
## Observed output example
```
@seq001 {"sample_id":"sampleA"}
atcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample_id":"sampleA"}
gctagctagctagctagcta
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample_id":"sampleA"}
ttagctaatcggtaatcggt
+
IIIIIIIIIIIIIIIIIIII
@seq009 {"sample_id":"sampleA"}
atgatgatgatgatgatgat
+
IIIIIIIIIIIIIIIIIIII
```
---
# OPTIONS
## Distribution mode
- **`--pattern|-p <string>`** — _(required)_
Default: none.
The template used to build output filenames. The variable part is represented by `%s`. Example: `toto_%s.fastq`.
- **`--classifier|-c <string>`**
Default: `""`.
The name of an annotation tag on the sequences. Sequences are dispatched into separate files based on the value of this tag. The tag value must be a string, integer, or boolean.
- **`--batches|-n <int>`**
Default: `0`.
Splits the input into exactly *N* batches by round-robin assignment, regardless of sequence metadata.
- **`--hash|-H <int>`**
Default: `0`.
Splits the input into at most *N* batches using a hash of the sequence. Produces deterministic, reproducible sharding.
- **`--directory|-d <string>`**
Default: `""`.
Used together with `--classifier`: organises output files into subdirectories named after classifier values.
## Output file handling
- **`--append|-A`**
Default: `false`.
Appends sequences to output files if they already exist, instead of overwriting them.
- **`--na-value <string>`**
Default: `"NA"`.
Value used as the filename component when a sequence does not have the classifier tag defined.
- **`--compress|-Z`**
Default: `false`.
Compresses all output files using gzip.
## Input format
- **`--fasta`**
Default: `false`.
Read data following the FASTA format.
- **`--fastq`**
Default: `false`.
Read data following the FASTQ format.
- **`--embl`**
Default: `false`.
Read data following the EMBL flatfile format.
- **`--genbank`**
Default: `false`.
Read data following the GenBank flatfile format.
- **`--ecopcr`**
Default: `false`.
Read data following the ecoPCR output format.
- **`--csv`**
Default: `false`.
Read data following the CSV format.
- **`--input-OBI-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow OBI format.
- **`--input-json-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow JSON format.
- **`--solexa`**
Default: `false`.
Decodes quality string according to the Solexa specification.
- **`--u-to-t`**
Default: `false`.
Convert Uracil to Thymine.
- **`--skip-empty`**
Default: `false`.
Sequences of length equal to zero are suppressed from the output.
- **`--no-order`**
Default: `false`.
When several input files are provided, indicates that there is no order among them.
## Output format
- **`--fasta-output`**
Default: `false`.
Write sequences in FASTA format (default if no quality data available).
- **`--fastq-output`**
Default: `false`.
Write sequences in FASTQ format (default if quality data available).
- **`--json-output`**
Default: `false`.
Write sequences in JSON format.
- **`--output-OBI-header|-O`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow OBI format.
- **`--output-json-header`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow JSON format.
- **`--out|-o <FILENAME>`**
Default: `"-"`.
Filename used for saving the output.
## Performance
- **`--max-cpu <int>`**
Default: `16`.
Number of parallel threads computing the result.
- **`--batch-size <int>`**
Default: `1`.
Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
Default: `2000`.
Maximum number of sequences per batch.
- **`--batch-mem <string>`**
Default: `""` (128M).
Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostic & debug
- **`--debug`**
Default: `false`.
Enable debug mode, by setting log level to debug.
- **`--no-progressbar`**
Default: `false`.
Disable the progress bar printing.
- **`--silent-warning`**
Default: `false`.
Stop printing of warning messages.
- **`--pprof`**
Default: `false`.
Enable pprof server. Look at the log for details.
- **`--pprof-goroutine <int>`**
Default: `6060`.
Enable profiling of goroutine blocking profile.
- **`--pprof-mutex <int>`**
Default: `10`.
Enable profiling of mutex lock.
---
# EXAMPLES
```bash
# Demultiplex sequences by sample_id annotation into per-sample FASTQ files
obidistribute --classifier sample_id --pattern out_ex1_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `out_ex1_sampleA.fastq` (4 sequences), `out_ex1_sampleB.fastq` (3 sequences), `out_ex1_sampleC.fastq` (2 sequences), `out_ex1_NA.fastq` (1 sequence).
```bash
# Demultiplex into subdirectories, one directory per sample
obidistribute --classifier sample_id --directory --pattern %s/reads.fastq reads.fastq
```
```bash
# Split a large dataset into 3 equal batches for parallel processing
obidistribute --batches 3 --pattern chunk_%s.fasta --fasta-output --no-progressbar sequences.fasta
```
**Expected output:** 10 sequences written to 3 files: `chunk_1.fasta` (4 sequences), `chunk_2.fasta` (3 sequences), `chunk_3.fasta` (3 sequences). Batch indices are 1-based.
```bash
# Hash-based sharding into 4 reproducible shards
obidistribute --hash 4 --pattern shard_%s.fastq --no-progressbar reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `shard_0.fastq` through `shard_3.fastq`. Shard indices are 0-based.
```bash
# Append new sequences to existing per-sample files (incremental demultiplexing)
obidistribute --classifier sample_id --pattern samples_%s.fastq --append new_reads.fastq
```
```bash
# Demultiplex sequences, replacing the NA label for unclassified sequences
obidistribute --classifier sample_id --na-value unclassified --pattern out_ex6_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files including `out_ex6_unclassified.fastq` (1 sequence without `sample_id` annotation).
---
# SEE ALSO
`obiconvert`, `obisplit`, `obigrep`
---
# NOTES
- Sequences that lack the annotation specified by `--classifier` are written to the file whose name is built using the `--na-value` (default: `"NA"`).
- The three distribution modes (`--classifier`, `--batches`, `--hash`) are mutually exclusive.
- When using `--directory` together with `--classifier`, subdirectories are created automatically if they do not exist.
- Batch indices produced by `--batches` are 1-based; hash shard indices produced by `--hash` are 0-based.
+326
View File
@@ -0,0 +1,326 @@
# obigrep(1) — OBITools4 Manual
## NAME
`obigrep` — select a subset of sequence records on various criteria
## SYNOPSIS
```
obigrep [OPTIONS] [FILE...]
```
## DESCRIPTION
`obigrep` filters a set of biological sequence records (in FASTA or FASTQ format) and writes only those matching all specified criteria to the output. Its name is modelled on the Unix `grep` command, but instead of filtering lines in a text file, it filters sequence records.
Filtering criteria can be combined freely: only sequence records satisfying **all** specified conditions are retained. The selection can be inverted with `--inverse-match` to keep the records that would otherwise be discarded.
Sequences are read from one or more files, or from standard input if no file is given. Results are written to standard output or to a file specified with `--out`. Records that do not pass the filters can optionally be saved to a separate file with `--save-discarded`.
## INPUT FORMATS
`obigrep` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## FILTERING OPTIONS
### By sequence length
- `--min-length LENGTH` / `-l LENGTH`
Keep only sequences at least *LENGTH* bases long.
- `--max-length LENGTH` / `-L LENGTH`
Keep only sequences at most *LENGTH* bases long.
### By read abundance
Sequence records can carry a `count` attribute recording how many times the sequence was observed. The following options filter on that count:
- `--min-count COUNT` / `-c COUNT`
Keep only sequences observed at least *COUNT* times (default: 1).
- `--max-count COUNT` / `-C COUNT`
Keep only sequences observed at most *COUNT* times.
### By sequence pattern
- `--sequence PATTERN` / `-s PATTERN`
Keep records whose nucleotide sequence matches the regular expression *PATTERN* (case-insensitive). This option can be repeated; all patterns must match.
- `--approx-pattern PATTERN`
Keep records whose sequence contains an approximate match to *PATTERN*. The number of allowed differences is controlled by `--pattern-error`. This option can be repeated.
- `--pattern-error N`
Maximum number of mismatches (or indels, if `--allows-indels` is set) tolerated when using `--approx-pattern` (default: 0, i.e. exact match).
- `--allows-indels`
Allow insertions and deletions (in addition to substitutions) when performing approximate pattern matching.
- `--only-forward`
Search patterns on the forward strand only. By default both strands are searched.
### By identifier or definition
- `--identifier PATTERN` / `-I PATTERN`
Keep records whose identifier matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
- `--id-list FILENAME`
Keep only records whose identifier appears in *FILENAME*, a plain-text file with one identifier per line.
- `--definition PATTERN` / `-D PATTERN`
Keep records whose definition line matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
### By attribute (metadata)
Sequence records can carry arbitrary key/value annotations:
- `--has-attribute KEY` / `-A KEY`
Keep records that possess an attribute named *KEY*, regardless of its value. Can be repeated.
- `--attribute KEY=PATTERN` / `-a KEY=PATTERN`
Keep records for which the value of attribute *KEY* matches the regular expression *PATTERN* (case-sensitive). Can be repeated; all constraints must be satisfied.
### By custom boolean expression
- `--predicate EXPRESSION` / `-p EXPRESSION`
Keep records for which the boolean expression *EXPRESSION* evaluates to true. Attributes are accessed via the `annotations` map (e.g. `annotations["count"]`). The special variable `sequence` refers to the sequence object; its length can be obtained with `len(sequence)`. Can be repeated; all expressions must be true.
Example: `-p 'annotations["count"] >= 10 && len(sequence) < 200'`
### By taxonomy
Taxonomy-based filtering requires a taxonomy database to be provided with `--taxonomy`.
- `--taxonomy PATH` / `-t PATH`
Path to the taxonomy database.
- `--restrict-to-taxon TAXID` / `-r TAXID`
Keep only records whose taxon belongs to the lineage of *TAXID* (i.e. is *TAXID* itself or a descendant). Can be repeated; sequences must satisfy at least one of the provided taxids.
- `--ignore-taxon TAXID` / `-i TAXID`
Discard records whose taxon belongs to the lineage of *TAXID*. Can be repeated.
- `--valid-taxid`
Keep only records that carry a valid, recognised taxonomic identifier.
- `--require-rank RANK_NAME`
Keep only records whose taxon has a defined ancestor at the given rank (e.g. *species*, *genus*, *family*). Can be repeated.
- `--update-taxid`
Automatically update merged taxids to their current valid equivalent.
- `--fail-on-taxonomy`
Exit with an error if a taxid referenced in the data is not valid.
- `--with-leaves`
When the taxonomy is extracted from a sequence file, attach each sequence as a leaf node under its annotated taxid.
- `--raw-taxid`
Print taxids in output files without supplementary information (taxon name and rank).
### Inversion
- `--inverse-match` / `-v`
Invert the selection: output the records that would otherwise be discarded.
## PAIRED-END OPTIONS
When paired-end sequencing data are provided (forward and reverse reads stored in two files), `obigrep` can apply filters taking both reads into account.
- `--paired-with FILENAME`
File containing the reverse (paired) reads.
- `--paired-mode MODE`
How to combine the filter result from the forward and reverse reads. *MODE* is one of:
| Mode | Meaning |
|------|---------|
| `forward` | Keep the pair if the **forward** read passes (default) |
| `reverse` | Keep the pair if the **reverse** read passes |
| `and` | Keep the pair if **both** reads pass |
| `or` | Keep the pair if **at least one** read passes |
| `andnot` | Keep the pair if the **forward** passes and the **reverse** does not |
| `xor` | Keep the pair if **exactly one** read passes |
## OUTPUT CONTROL
- `--save-discarded FILENAME`
Write sequence records that do **not** pass the filters to *FILENAME*.
- `--out FILENAME` / `-o FILENAME`
Write the selected records to *FILENAME* (default: standard output).
- `--skip-empty`
Suppress sequences of length zero from the output.
## PERFORMANCE OPTIONS
- `--max-cpu N`
Number of parallel processing threads (default: number of available CPUs).
- `--batch-size N`
Minimum number of sequences per processing batch (default: 1).
- `--batch-size-max N`
Maximum number of sequences per processing batch (default: 2000).
- `--batch-mem SIZE`
Maximum memory per batch (e.g. `128M`, `1G`). Overrides `--batch-size-max` when memory is the limiting factor. Can also be set via the environment variable `OBIBATCHMEM`.
- `--no-order`
When multiple input files are provided, indicates that no ordering is assumed between them, which can improve throughput.
- `--no-progressbar`
Disable the progress bar.
## MISCELLANEOUS OPTIONS
- `--u-to-t`
Convert uracil (U) to thymine (T) in all sequences (useful for RNA data).
- `--solexa`
Decode quality scores according to the legacy Solexa specification instead of the standard Phred encoding.
- `--silent-warning`
Suppress warning messages.
- `--debug`
Enable verbose debug logging.
- `--version`
Print version information and exit.
- `--help` / `-h` / `-?`
Display the help message and exit.
## EXAMPLES
Keep all sequences longer than 100 bases:
```bash
obigrep --min-length 100 input.fasta > out_min_length.fasta
```
**Expected output:** 6 sequences written to `out_min_length.fasta`.
Select sequences observed at least 10 times:
```bash
obigrep --min-count 10 input.fasta > out_min_count.fasta
```
**Expected output:** 4 sequences written to `out_min_count.fasta`.
Keep sequences whose identifier starts with `BOLD`:
```bash
obigrep --identifier '^BOLD' input.fasta > out_bold.fasta
```
**Expected output:** 2 sequences written to `out_bold.fasta`.
Select only sequences carrying the IUPAC primer motif `GGGCWATGTTTCATAAYGGG` with up to 2 mismatches:
```bash
obigrep --approx-pattern GGGCWATGTTTCATAAYGGG --pattern-error 2 input.fasta > out_primer.fasta
```
**Expected output:** 2 sequences written to `out_primer.fasta`.
Retain sequences belonging to the genus *Homo* (taxid 9605) in an NCBI taxonomy:
```bash
obigrep --taxonomy /data/ncbi_tax --restrict-to-taxon 9605 input.fasta
```
Keep sequences that have a `sample` attribute equal to `lake1` and save the rest to a separate file:
```bash
obigrep --attribute sample='^lake1$' --save-discarded discarded.fasta \
input.fasta > lake1.fasta
```
**Expected output:** 5 sequences written to `lake1.fasta`, 5 sequences written to `discarded.fasta`.
Invert a length filter (discard sequences shorter than 50 bases):
```bash
obigrep --min-length 50 --inverse-match input.fasta > out_short.fasta
```
**Expected output:** 1 sequence written to `out_short.fasta`.
Apply a custom predicate (sequences with count ≥ 5):
```bash
obigrep -p 'annotations["count"] >= 5' input.fasta > out_predicate.fasta
```
**Expected output:** 6 sequences written to `out_predicate.fasta`.
## OUTPUT
### Attribute table
Attributes present on sequence records are preserved unchanged in the output. No new attributes are added by `obigrep` itself — only filtering occurs.
| Attribute | Type | Description |
|-----------|------|-------------|
| `count` | integer | Number of times the sequence was observed (read from input) |
| `sample` | string | Sample identifier (read from input) |
Any other annotations present in the input are carried through to the output unmodified.
### Observed output example
```
>seq001 {"count":15,"sample":"lake1"}
acgtacgtacgtacgtacgtgggcaatgtttcataatgggacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":3,"sample":"lake1"}
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
>seq004 {"count":2,"sample":"lake1"}
aaacccgggtttagctagctagctagctagctagctagctagctagctagctagctagct
agctagctagctagctagctagctagctagctagctagctagctagctagctagctagct
atacgtatcgatcg
>BOLD_005 {"count":8,"sample":"pond1"}
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq008 {"count":7,"sample":"river2"}
ttacgatcgatcgatcgatcgggcaatgtttcataaggggacgatcgatcgatcgatcga
tcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
```
## SEE ALSO
`obiannotate`(1), `obiuniq`(1), `obiconvert`(1), `obitag`(1), `obisplit`(1)
## OBITools4
`obigrep` is part of the **OBITools4** suite for analysing DNA metabarcoding and environmental DNA data.
+257
View File
@@ -0,0 +1,257 @@
# NAME
obijoin — merge annotations contained in a file to another file
---
# SYNOPSIS
```
obijoin --join-with|-j <string> [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--by|-b <string>]... [--compress|-Z]
[--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-id|-i]
[--update-quality|-q] [--update-sequence|-s] [--update-taxid]
[--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obijoin` merges annotations from a secondary file into a primary sequence dataset. For each sequence in the primary input, it looks up matching records in the secondary file based on one or more shared attribute keys, then copies all annotations from matched partner records onto the primary sequence.
The join is a **left outer join**: every sequence in the primary dataset is preserved in the output, whether or not a match is found in the secondary file. Unmatched sequences simply receive no additional annotations. Key matching is exact string equality.
A common use case is enriching amplicon or read sequences with external sample metadata. The secondary file (the *annotation source*) can be a FASTA/FASTQ sequence file, a CSV table, an EMBL or GenBank flat file, or any other format accepted by OBITools4. This makes it straightforward to prepare a simple spreadsheet with sample identifiers and metadata columns, save it as CSV, and merge it directly into a sequence dataset — the CSV format is auto-detected, no format conversion or extra flag is needed. <!-- corrected: secondary CSV is auto-detected; --csv flag is not needed for the secondary file -->
In addition to transferring annotations, `obijoin` can optionally replace the sequence identifier, nucleotide sequence, or quality scores of each primary sequence with values from its matched partner, controlled by the `--update-id`, `--update-sequence`, and `--update-quality` flags.
---
# INPUT
`obijoin` accepts a primary sequence dataset on standard input or as one or more file arguments. The supported formats are automatically detected and include FASTA, FASTQ, EMBL, GenBank, ecoPCR output, CSV, and JSON. Format-specific flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) can force a specific parser when auto-detection is ambiguous.
The secondary file, supplied via `--join-with`, is loaded entirely into memory before processing begins, and supports the same set of formats including CSV — the format is auto-detected automatically. <!-- corrected: removed incorrect claim that --csv is needed for secondary file -->
When multiple primary input files are provided and their ordering across files is irrelevant, `--no-order` allows the reader to return batches in whichever order they complete, improving throughput.
---
# OUTPUT
The output is a sequence file in FASTA or FASTQ format (determined automatically by the presence of quality data), written to standard output or to the file specified by `--out`. Alternative output formats can be requested with `--fasta-output`, `--fastq-output`, or `--json-output`. The output can be gzip-compressed with `--compress`.
Each output sequence carries all annotations from the primary dataset, enriched with every annotation attribute copied from the matched partner record. If a field name exists in both, the partner value overwrites the primary value. When `--update-id`, `--update-sequence`, or `--update-quality` are set, the corresponding sequence-level fields are also replaced with the partner's values.
## Observed output example
```
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
aaaaatttttcccccggggg
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa
```
---
# OPTIONS
## Required
`--join-with|-j <string>`
: Path to the secondary file whose records are joined onto the primary sequences. This parameter is mandatory. The file can be in any format accepted by OBITools4 (FASTA, FASTQ, CSV, EMBL, GenBank, ecoPCR); the format is auto-detected. Default: none.
## Join control
`--by|-b <string>`
: Declares a join key as an attribute name or a `primary_attr=secondary_attr` mapping. Repeat the flag to join on multiple keys simultaneously; all keys must match for a record pair to be considered a hit (intersection semantics). When omitted, the join defaults to matching by sequence identifier (`id`). Default: `[]`.
`--update-id|-i`
: Replace the identifier of each primary sequence with the identifier from its matched partner record. Default: `false`.
`--update-sequence|-s`
: Replace the nucleotide or amino acid sequence of each primary sequence with the sequence from its matched partner. Default: `false`.
`--update-quality|-q`
: Replace the per-base quality scores of each primary sequence with the quality scores from its matched partner. Relevant only when both datasets carry quality information (FASTQ). Default: `false`.
## Input format
`--csv`
: Read the primary input data in OBITools CSV format (e.g., sequences exported by `obicsv`). This flag applies to the primary input only; secondary files supplied via `--join-with` are always auto-detected. Default: `false`. <!-- corrected: --csv affects primary input only, not the secondary annotation file -->
`--ecopcr`
: Read data following the ecoPCR output format. Default: `false`.
`--embl`
: Read data following the EMBL flatfile format. Default: `false`.
`--fasta`
: Read data following the FASTA format. Default: `false`.
`--fastq`
: Read data following the FASTQ format. Default: `false`.
`--genbank`
: Read data following the GenBank flatfile format. Default: `false`.
`--input-OBI-header`
: Treat FASTA/FASTQ title line annotations as OBI format. Default: `false`.
`--input-json-header`
: Treat FASTA/FASTQ title line annotations as JSON format. Default: `false`.
`--solexa`
: Decode the quality string according to the Solexa specification. Default: `false`.
`--u-to-t`
: Convert uracil (U) to thymine (T) in input sequences. Default: `false`.
`--skip-empty`
: Suppress sequences of length zero from the output. Default: `false`.
`--no-order`
: When several input files are provided, indicates that there is no order among them. Default: `false`.
## Output format
`--out|-o <FILENAME>`
: Filename used for saving the output. Default: `-` (standard output).
`--fasta-output`
: Write sequences in FASTA format (default when no quality data are available). Default: `false`.
`--fastq-output`
: Write sequences in FASTQ format (default when quality data are available). Default: `false`.
`--json-output`
: Write sequences in JSON format. Default: `false`.
`--output-OBI-header|-O`
: Output FASTA/FASTQ title line annotations in OBI format. Default: `false`.
`--output-json-header`
: Output FASTA/FASTQ title line annotations in JSON format. Default: `false`.
`--compress|-Z`
: Compress the output using gzip. Default: `false`.
## Taxonomy
`--taxonomy|-t <string>`
: Path to the taxonomy database. Default: `""`.
`--fail-on-taxonomy`
: Cause `obijoin` to fail with an error if a taxid encountered is not currently valid. Default: `false`.
`--raw-taxid`
: Print taxids in files without supplementary information (taxon name and rank). Default: `false`.
`--update-taxid`
: Automatically update taxids that are declared as merged to a newer one. Default: `false`.
`--with-leaves`
: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation. Default: `false`.
## Performance
`--max-cpu <int>`
: Number of parallel threads used to compute the result. Default: `16`.
`--batch-size <int>`
: Minimum number of sequences per processing batch. Default: `1`.
`--batch-size-max <int>`
: Maximum number of sequences per processing batch. Default: `2000`.
`--batch-mem <string>`
: Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable. Default: `128M`.
## Diagnostics
`--no-progressbar`
: Disable the progress bar. Default: `false`.
`--silent-warning`
: Stop printing warning messages. Default: `false`.
`--debug`
: Enable debug mode by setting the log level to debug. Default: `false`.
---
# EXAMPLES
```bash
# Annotate amplicon sequences with sample metadata from a CSV table,
# matching on the sample attribute. CSV format is auto-detected.
obijoin --join-with metadata.csv --by sample input.fasta > out_basic.fasta
```
**Expected output:** 6 sequences written to `out_basic.fasta`.
```bash
# Join using a cross-attribute key: primary sequences have a 'sample' attribute,
# while the annotation CSV uses 'well' for the same identifier.
obijoin --join-with well_metadata.csv --by sample=well input.fasta > out_crosskey.fasta
```
**Expected output:** 6 sequences written to `out_crosskey.fasta`.
```bash
# Join on two keys simultaneously: match only when both sample and barcode agree,
# then update sequence identifiers with those from the reference file.
obijoin --join-with references.fasta \
--by sample --by barcode \
--update-id \
input.fasta > out_multikey.fasta
```
**Expected output:** 6 sequences written to `out_multikey.fasta`.
```bash
# Replace sequences and quality scores of reads with values from a corrected FASTQ file,
# joining by sequence ID (default when no --by is specified).
obijoin --join-with corrected.fastq \
--update-sequence --update-quality \
input.fastq > out_updated.fastq
```
**Expected output:** 3 sequences written to `out_updated.fastq`.
```bash
# Use an OBITools CSV file as primary input (--csv flag), join with a metadata CSV,
# then write compressed FASTA output without showing the progress bar.
obijoin --join-with metadata.csv --by sample \
--csv --fasta-output --compress \
--no-progressbar \
--out out_compressed.fasta.gz \
primary.csv
```
**Expected output:** 3 sequences written to `out_compressed.fasta.gz`.
---
# NOTES
- The secondary file supplied via `--join-with` is loaded entirely into memory before the join begins. For very large secondary files this may require significant RAM.
- Key matching is based on exact string equality; no regular expression or fuzzy matching is applied.
- The join is a left outer join: primary sequences without a matching partner in the secondary file are still emitted, unchanged, in the output.
- When the annotation source is a plain CSV spreadsheet (columns = attributes, rows = records), the format is auto-detected — no `--csv` flag is needed. The `--csv` flag applies exclusively to the primary input and is intended for sequences stored in OBITools CSV format.
+205
View File
@@ -0,0 +1,205 @@
# NAME
obimicrosat — looks for microsatellites sequences in a sequence file
---
# SYNOPSIS
```
obimicrosat [options] [<filename>...]
```
---
# DESCRIPTION
`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
microsatellites — tandem repetitions of a short motif (16 bp by default). For each
sequence containing a qualifying repeat, the command annotates it with the location,
unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
with no detected microsatellite are silently discarded.
The detection works in two passes. A first regular expression finds any tandem repeat
satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
is then determined, and a second scan refines the exact boundaries. The repeat unit is
normalized to its lexicographically smallest rotation across all rotations and its
reverse complement, which allows equivalent loci to be grouped consistently across
samples.
By default, when the canonical form of a unit requires the reverse complement, the
whole sequence is reoriented so that the microsatellite is always reported on the
direct strand of the normalized unit. This behaviour can be disabled with
`--not-reoriented`.
A common use case is identifying polymorphic SSR markers for population genetics, or
flagging repeat-rich regions before designing PCR primers.
---
# INPUT
Accepts one or more sequence files on the command line. If no file is given, sequences
are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
Format is detected automatically unless overridden by input flags.
---
# OUTPUT
Outputs only the sequences in which a microsatellite was found. Each retained sequence
carries the following additional attributes:
| Attribute | Content |
|---|---|
| `microsat` | Full repeat region as a string |
| `microsat_from` | 1-based start position of the repeat |
| `microsat_to` | End position of the repeat (inclusive) |
| `microsat_unit` | Repeat unit as observed in the sequence |
| `microsat_unit_normalized` | Lexicographically smallest canonical form |
| `microsat_unit_orientation` | `direct` or `reverse` |
| `microsat_unit_length` | Length of the repeat unit (bp) |
| `microsat_unit_count` | Number of complete unit repetitions |
| `seq_length` | Total length of the (possibly reoriented) sequence |
| `microsat_left` | Flanking sequence to the left of the repeat |
| `microsat_right` | Flanking sequence to the right of the repeat |
When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
identifier.
The output format follows the same rules as the rest of OBITools4: FASTQ when quality
scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
## Observed output example
```
>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
```
---
# OPTIONS
## Microsatellite detection
**`--min-unit-length` / `-m`**
- Default: `1`
- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
**`--max-unit-length` / `-M`**
- Default: `6`
- Maximum length in base pairs of the repeated motif. Increasing this value detects
longer repeat units (minisatellites) at the cost of more complex patterns.
**`--min-unit-count`**
- Default: `5`
- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
requires at least 10 bp of pure repeat.
**`--min-length` / `-l`**
- Default: `20`
- Minimum total length (in bp) of the repeat region. This filter applies after the
unit-count filter and is useful to exclude very short but technically qualifying
repeats.
**`--min-flank-length` / `-f`**
- Default: `0`
- Minimum length of the flanking sequence on each side of the repeat. Sequences with
flanks shorter than this threshold are discarded, which is useful when the output
will feed a primer-design step.
**`--not-reoriented` / `-n`**
- Default: `false` (reorientation is active by default)
- When set, sequences are never reverse-complemented to match the canonical orientation
of the repeat unit. The microsatellite is reported as found, in its original
orientation.
## Input / output
Inherited from the standard OBITools4 conversion layer. Common flags include:
**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
**`--input-json-header`** — parse JSON-encoded headers.
**`--skip-empty`** — skip sequences with no nucleotides.
**`--u-to-t`** — convert U to T (RNA → DNA).
**`--output-json-header`** — write JSON-encoded headers.
**`--output-obi-header`** — write OBI-style headers.
**`--gzip`** — compress output with gzip.
**`--workers` / `-p`** — number of parallel processing workers.
---
# EXAMPLES
```bash
# Detect default microsatellites (unit 16 bp, ≥5 repeats, ≥20 bp total)
obimicrosat sequences.fasta > out_default.fasta
```
**Expected output:** 6 sequences written to `out_default.fasta`.
```bash
# Restrict to di- and trinucleotide repeats only
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
```
**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
(mononucleotide and tetranucleotide repeats excluded).
```bash
# Require at least 30 bp flanking sequence on each side (for primer design)
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
```
**Expected output:** 3 sequences written to `out_primer_ready.fasta`
(sequences with flanks shorter than 30 bp are discarded).
```bash
# Keep sequences in their original orientation (no reverse-complement)
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
```
**Expected output:** 6 sequences written to `out_no_reorient.fasta`
(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
```bash
# Require at least 8 repeat units and a minimum repeat length of 30 bp
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
```
**Expected output:** 4 sequences written to `out_strict.fasta`
(short or low-count repeats excluded).
---
# SEE ALSO
`obigrep` — filter sequences by annotation after microsatellite detection.
`obiannotate` — add or modify sequence annotations.
`obiconvert` — format conversion for sequence files.
---
# NOTES
- Only sequences with at least one qualifying microsatellite are written to output;
all others are silently filtered out.
- The normalization algorithm considers all rotations of the unit and their reverse
complements, selecting the lexicographically smallest string. This ensures consistent
grouping of loci regardless of which strand was sequenced.
- When reorientation is active (the default), sequences whose canonical unit falls on
the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
(possibly reoriented) output sequence.
- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
first match is reported per sequence.
- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
repeat within the flank instead of the intended SSR. When designing synthetic test
data, ensure flanking regions do not contain tandem repeat motifs of their own.
+384
View File
@@ -0,0 +1,384 @@
# NAME
obiscript — executes a lua script on the input sequences
---
# SYNOPSIS
```
obiscript [--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>] [--compress|-Z]
[--csv] [--debug] [--definition|-D <PATTERN>]... [--ecopcr]
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--has-attribute|-A <KEY>]...
[--help|-h|-?] [--id-list <FILENAME>]
[--identifier|-I <PATTERN>]... [--ignore-taxon|-i <TAXID>]...
[--input-OBI-header] [--input-json-header] [--inverse-match|-v]
[--json-output] [--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--pattern-error <int>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
[--raw-taxid] [--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--script|-S <string>]
[--sequence|-s <PATTERN>]... [--silent-warning] [--skip-empty]
[--solexa] [--taxonomy|-t <string>] [--template] [--u-to-t]
[--update-taxid] [--valid-taxid] [--version] [--with-leaves]
[<args>]
```
---
# DESCRIPTION
`obiscript` applies a user-provided Lua script to a stream of biological sequences. For each input sequence record, the script's `worker(sequence)` function is called, giving the user full programmatic access to the sequence's identifier, data, quality scores, and metadata attributes. This makes it possible to implement custom annotation logic, computed filters, or record transformations that go beyond what fixed-function OBITools commands offer.
The Lua script may also define two optional lifecycle hooks: `begin()`, called once before any sequence is processed (useful for initialising counters or opening files), and `finish()`, called after the last sequence (useful for printing summary statistics or flushing output). A thread-safe shared table `obicontext` is available across all workers, allowing aggregation across parallel executions.
Sequences are read from files or standard input in any format supported by OBITools4 (FASTA, FASTQ, EMBL, GenBank, ecoPCR, CSV). The sequence filtering flags (such as `--min-length`, `--predicate`, etc.) select which sequences the Lua script is applied to; sequences that do not match the filter pass through to the output unchanged without the script being executed on them. All sequences — scripted or not — are written to the output. <!-- corrected: non-matching sequences are passed through unchanged, not discarded -->
To get started, use `--template` to print a minimal Lua script skeleton with stubs for all three hooks and inline documentation.
---
# INPUT
`obiscript` reads biological sequences from one or more files supplied as positional arguments, or from standard input if no files are given. All formats supported by OBITools4 are accepted: FASTA, FASTQ, EMBL flatfile, GenBank flatfile, ecoPCR output, and CSV. Format auto-detection is used by default; explicit format flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) override it. Header annotation style can be forced with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Sequences processed by the Lua script are written to standard output, or to the file given by `--out`. Any modifications made to sequence records inside `worker()` (identifier, sequence, attributes) are reflected in the output. The output format defaults to FASTA when no quality data are present and to FASTQ otherwise; use `--fasta-output`, `--fastq-output`, or `--json-output` to override. Header annotation style in FASTA/FASTQ output can be set with `--output-OBI-header` or `--output-json-header`. Output can be gzip-compressed with `--compress`.
## Observed output example
```
>sample1_seq001 {"definition":"control sequence for annotation test","sample":"sample1"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>sample1_seq002 {"definition":"another control sequence from sample1","sample":"sample1"}
gctagctagctagctagctagctagctagctagctagctagctagcta
>sample2_seq003 {"definition":"second sample sequence","sample":"sample2"}
ttaattaattaattaattaattaattaattaattaattaattaattaa
>sample2_seq004 {"definition":"second sample another sequence","sample":"sample2"}
ccggccggccggccggccggccggccggccggccggccggccggccgg
>sample3_seq005 {"definition":"third sample first sequence","sample":"sample3"}
aaaattttccccggggaaaattttccccggggaaaattttccccgggg
>sample3_seq006 {"definition":"third sample second sequence","sample":"sample3"}
ttttaaaaccccggggttttaaaaccccggggttttaaaaccccgggg
```
---
# OPTIONS
## Script
### `--script|-S <string>`
- Default: `""`
- Path to the Lua script file to execute. The file must exist and be syntactically valid Lua. The script should define a `worker(sequence)` function, and optionally `begin()` and `finish()`.
### `--template`
- Default: `false`
- Print a minimal Lua script template to standard output, with stubs for `begin()`, `worker()`, and `finish()` and inline documentation, then exit. Use this to bootstrap a new script.
## Sequence filtering (selects sequences on which the script is applied; non-matching sequences pass through unchanged)
### `--predicate|-p <EXPRESSION>`
- Default: `[]`
- Boolean expression evaluated for each sequence record. Attribute keys are accessible as variable names; `sequence` refers to the record itself. Multiple `-p` options are combined with AND.
### `--sequence|-s <PATTERN>`
- Default: `[]`
- Regular expression matched against the nucleotide sequence. Case-insensitive. Multiple patterns are combined with AND.
### `--identifier|-I <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence identifier. Case-insensitive.
### `--definition|-D <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence definition line. Case-insensitive.
### `--approx-pattern <PATTERN>`
- Default: `[]`
- Pattern matched approximately against the sequence. Use `--pattern-error` to set the maximum number of errors allowed.
### `--pattern-error <int>`
- Default: `0`
- Maximum number of errors (mismatches) allowed during approximate pattern matching.
### `--allows-indels`
- Default: `false`
- Allow insertions and deletions (in addition to mismatches) during approximate pattern matching.
### `--only-forward`
- Default: `false`
- Restrict pattern matching to the forward strand only.
### `--has-attribute|-A <KEY>`
- Default: `[]`
- Apply the script only to records that have an attribute with key `<KEY>`; others pass through.
### `--attribute|-a <KEY=VALUE>`
- Default: `{}`
- Apply the script only to records where the attribute `KEY` matches the regular expression `VALUE`. Case-sensitive. Multiple `-a` options are combined with AND.
### `--id-list <FILENAME>`
- Default: `""`
- Path to a text file containing one sequence identifier per line. The script is applied only to records whose identifier appears in the file; others pass through.
### `--min-length|-l <LENGTH>`
- Default: `1`
- Apply the script only to sequences whose length is at least `LENGTH`; shorter sequences pass through unchanged.
### `--max-length|-L <LENGTH>`
- Default: `2000000000`
- Apply the script only to sequences whose length is at most `LENGTH`; longer sequences pass through unchanged.
### `--min-count|-c <COUNT>`
- Default: `1`
- Apply the script only to sequences with a count (abundance) of at least `COUNT`; others pass through unchanged.
### `--max-count|-C <COUNT>`
- Default: `2000000000`
- Apply the script only to sequences with a count (abundance) of at most `COUNT`; others pass through unchanged.
### `--inverse-match|-v`
- Default: `false`
- Invert the selection: apply the script to records that do NOT match the filter criteria; matching records pass through unchanged.
## Taxonomic filtering
### `--taxonomy|-t <string>`
- Default: `""`
- Path to the taxonomy database. Required for taxonomy-based options.
### `--restrict-to-taxon|-r <TAXID>`
- Default: `[]`
- Retain only sequences whose taxid belongs to the specified taxon.
### `--ignore-taxon|-i <TAXID>`
- Default: `[]`
- Exclude sequences whose taxid belongs to the specified taxon.
### `--require-rank <RANK_NAME>`
- Default: `[]`
- Retain only sequences whose taxon has the specified rank (e.g., `species`, `genus`).
### `--valid-taxid`
- Default: `false`
- Retain only sequences that carry a currently valid NCBI taxid.
### `--fail-on-taxonomy`
- Default: `false`
- Abort with an error if a taxid used during filtering is not currently valid.
### `--update-taxid`
- Default: `false`
- Automatically replace taxids declared as merged with their current equivalent.
### `--raw-taxid`
- Default: `false`
- Print taxids in output without supplementary information (taxon name and rank).
### `--with-leaves`
- Default: `false`
- When extracting taxonomy from a sequence file, attach sequences as leaves of their taxid annotation.
## Paired-end mode
### `--paired-mode <forward|reverse|and|or|andnot|xor>`
- Default: `"forward"`
- When paired reads are provided, determines how filter conditions are applied to both reads of a pair.
## Input format
### `--fasta`
- Default: `false`
- Force FASTA format parsing.
### `--fastq`
- Default: `false`
- Force FASTQ format parsing.
### `--embl`
- Default: `false`
- Force EMBL flatfile format parsing.
### `--genbank`
- Default: `false`
- Force GenBank flatfile format parsing.
### `--ecopcr`
- Default: `false`
- Force ecoPCR output format parsing.
### `--csv`
- Default: `false`
- Force CSV format parsing.
### `--input-OBI-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as OBI format.
### `--input-json-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as JSON format.
### `--solexa`
- Default: `false`
- Decode quality strings according to the Solexa specification.
### `--u-to-t`
- Default: `false`
- Convert uracil (U) to thymine (T) in sequences.
### `--skip-empty`
- Default: `false`
- Suppress sequences of length zero from the output.
### `--no-order`
- Default: `false`
- When multiple input files are provided, indicates that no ordering is assumed among them.
## Output format
### `--out|-o <FILENAME>`
- Default: `"-"` (standard output)
- File path for saving the output.
### `--fasta-output`
- Default: `false`
- Write output in FASTA format.
### `--fastq-output`
- Default: `false`
- Write output in FASTQ format.
### `--json-output`
- Default: `false`
- Write output in JSON format.
### `--output-OBI-header|-O`
- Default: `false`
- Write FASTA/FASTQ title line annotations in OBI format.
### `--output-json-header`
- Default: `false`
- Write FASTA/FASTQ title line annotations in JSON format.
### `--compress|-Z`
- Default: `false`
- Compress output using gzip.
## Performance
### `--max-cpu <int>`
- Default: `16` (env: `OBIMAXCPU`)
- Number of parallel threads used for processing.
### `--batch-size <int>`
- Default: `1` (env: `OBIBATCHSIZE`)
- Minimum number of sequences per processing batch.
### `--batch-size-max <int>`
- Default: `2000` (env: `OBIBATCHSIZEMAX`)
- Maximum number of sequences per processing batch.
### `--batch-mem <string>`
- Default: `""``128M` (env: `OBIBATCHMEM`)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostics
### `--debug`
- Default: `false` (env: `OBIDEBUG`)
- Enable debug logging.
### `--no-progressbar`
- Default: `false`
- Disable the progress bar.
### `--silent-warning`
- Default: `false` (env: `OBIWARNING`)
- Suppress warning messages.
### `--pprof`
- Default: `false`
- Enable the pprof profiling HTTP server (see log for address).
### `--pprof-goroutine <int>`
- Default: `6060` (env: `OBIPPROFGOROUTINE`)
- Port for goroutine blocking profile.
### `--pprof-mutex <int>`
- Default: `10` (env: `OBIPPROFMUTEX`)
- Rate for mutex lock profiling.
---
# EXAMPLES
```bash
# Print a starter script template and save it to my_script.lua
obiscript --template > my_script.lua
```
**Expected output:** Lua template with `begin()`, `worker()`, and `finish()` stubs written to `my_script.lua`.
```bash
# Add a custom annotation to every sequence record
# (the script sets a new attribute 'sample' from the identifier prefix)
obiscript --script annotate.lua --fasta-output sequences.fasta > annotated.fasta
```
**Expected output:** 6 sequences written to `annotated.fasta`.
```bash
# Count reads per taxon using the finish() hook, filtering to a specific taxon
obiscript --script count_taxa.lua \
--restrict-to-taxon 6231 \
--taxonomy /data/ncbi_tax \
sequences.fasta > filtered_annotated.fasta
```
```bash
# Apply a script to FASTQ sequences with a length filter
obiscript --script process_pairs.lua \
--min-length 100 \
--out result.fastq \
reads.fastq
```
**Expected output:** 4 sequences written to `result.fastq`.
```bash
# Run on FASTQ input, output JSON, using 4 CPU threads
obiscript --script enrich.lua \
--json-output \
--max-cpu 4 \
sequences.fastq > enriched.json
```
**Expected output:** 4 sequences written to `enriched.json`.
---
# SEE ALSO
`obigrep` — filter sequences using the same selection criteria without scripting.
`obiannotate` — add or modify sequence attributes without scripting.
---
# NOTES
- The Lua `worker(sequence)` function is called in parallel across multiple goroutines. Use the thread-safe `obicontext` table (with `obicontext:lock()` / `obicontext:unlock()`) for any shared mutable state accessed across workers.
- The `begin()` and `finish()` hooks each run in a single goroutine and do not need locking for their own internal state.
- Sequence records modified inside `worker()` must be returned (or the original returned unmodified) for the record to appear in the output. Returning `nil` drops the sequence.
+271
View File
@@ -0,0 +1,271 @@
# NAME
obisummary — resume main information from a sequence file
---
# SYNOPSIS
```
obisummary [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
```
---
# DESCRIPTION
`obisummary` reads a set of biological sequences and computes a statistical
summary of their content and annotations. Rather than producing a new sequence
file, it outputs a single structured record describing the dataset as a whole.
The summary covers three main areas. First, global counts: the total number of
reads (sequences weighted by their `count` attribute), the number of distinct
sequence variants, and the total sequence length across all records. Second,
annotation profiling: `obisummary` inspects every annotation key present in
the dataset and classifies it as a scalar attribute (single value per
sequence), a map attribute (key-to-count mapping), or a vector attribute
(multi-value per sequence). Third, per-sample statistics: when sequences carry
sample information (via `merged_sample` or equivalent per-sample annotations),
`obisummary` reports for each sample the number of reads, the number of
variants, and the number of singletons. If `obiclean` has been run previously,
the summary also captures `obiclean_status` and related quality flags per
sample.
The output is a single JSON record by default, or YAML when `--yaml-output` is
requested. <!-- corrected: actual default output is JSON, not YAML -->
`obisummary` is typically used after processing steps such as
`obiclean` or `obiuniq` to quickly validate the state of a dataset before
downstream analysis.
---
# INPUT
`obisummary` accepts biological sequence data from one or more files supplied
as positional arguments, or from standard input when no files are given.
Supported formats include FASTA, FASTQ, GenBank flatfile, EMBL flatfile,
ecoPCR output, and CSV. By default the format is detected automatically; use
the format flags (`--fasta`, `--fastq`, `--genbank`, `--embl`, `--ecopcr`,
`--csv`) to force a specific parser.
FASTA/FASTQ annotation headers may follow the OBI format (`--input-OBI-header`)
or JSON format (`--input-json-header`). RNA sequences can be read as DNA by
converting uracil to thymine with `--u-to-t`. Quality strings encoded according
to the Solexa specification are handled with `--solexa`.
When multiple input files are provided, `obisummary` assumes they are ordered;
use `--no-order` to indicate that no ordering exists among them.
---
# OUTPUT
`obisummary` writes a single structured record to standard output. The default
format is JSON; use `--yaml-output` to obtain YAML instead.
<!-- corrected: actual default output is JSON, not YAML -->
The record contains three top-level sections:
- **`count`**: global metrics including `variants` (distinct sequences),
`reads` (total weighted count), and `total_length` (sum of all sequence
lengths).
- **`annotations`**: a breakdown of all annotation keys found in the dataset,
classified as `scalar_attributes`, `map_attributes`, or `vector_attributes`,
together with the observed keys and their occurrence counts within each
category.
- **`samples`**: when sample information is present, `sample_count` and a
per-sample `sample_stats` table with `reads`, `variants`, and `singletons`
fields. If `obiclean` data is present, an `obiclean_bad` field is also
reported per sample.
When `--map` is used, the named map attribute is included in the annotation
detail for that attribute.
## Observed output example
```
{
"annotations": {
"keys": {
"scalar": {
"count": 5
}
},
"map_attributes": 0,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 21,
"total_length": 100,
"variants": 5
}
}
```
---
# OPTIONS
## Summary output
**`--json-output`**
- Default: `false`
- Print the result as a JSON record (this is the default behaviour; this flag
makes the choice explicit).
<!-- corrected: JSON is the default output format, not YAML -->
**`--yaml-output`**
- Default: `false`
- Print the result as a YAML record instead of the default JSON format.
<!-- corrected: YAML is not the default; JSON is -->
**`--map <string>`**
- Default: `[]` (none)
- Name of a map attribute to include in the summary. This option may be
repeated to request multiple map attributes. Each named attribute will be
detailed in the `map_attributes` section of the output.
## Input format
**`--fasta`**
- Default: `false`
- Read data following the FASTA format.
**`--fastq`**
- Default: `false`
- Read data following the FASTQ format.
**`--genbank`**
- Default: `false`
- Read data following the GenBank flatfile format.
**`--embl`**
- Default: `false`
- Read data following the EMBL flatfile format.
**`--ecopcr`**
- Default: `false`
- Read data following the ecoPCR output format.
**`--csv`**
- Default: `false`
- Read data following the CSV format.
**`--input-OBI-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow OBI format.
**`--input-json-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow JSON format.
**`--solexa`**
- Default: `false`
- Decode quality strings according to the Solexa specification.
**`--u-to-t`**
- Default: `false`
- Convert uracil (U) to thymine (T) when reading RNA sequences.
## Batch control
**`--batch-size <int>`**
- Default: `1`
- Minimum number of sequences per processing batch.
**`--batch-size-max <int>`**
- Default: `2000`
- Maximum number of sequences per processing batch.
**`--batch-mem <string>`**
- Default: `""` (128M effective)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable
the memory limit.
## Processing
**`--max-cpu <int>`**
- Default: `16`
- Number of parallel threads used to compute the result.
**`--no-order`**
- Default: `false`
- When several input files are provided, indicates that there is no order
among them.
## General
**`--debug`**
- Default: `false`
- Enable debug mode by setting the log level to debug.
**`--silent-warning`**
- Default: `false`
- Stop printing warning messages.
**`--version`**
- Default: `false`
- Print the version and exit.
**`--help` / `-h` / `-?`**
- Default: `false`
- Display help and exit.
**`--pprof`**
- Default: `false`
- Enable the pprof profiling server. Consult the log for the server address.
**`--pprof-goroutine <int>`**
- Default: `6060`
- Port for goroutine blocking profile.
**`--pprof-mutex <int>`**
- Default: `10`
- Port for mutex lock profiling.
---
# EXAMPLES
```bash
# Get a JSON summary of a FASTA file produced by obiclean
obisummary cleaned.fasta > out_default.yaml
```
**Expected output:** a JSON summary record in `out_default.yaml`.
```bash
# Get the summary as an explicit JSON record for programmatic processing
obisummary --json-output cleaned.fasta > out_json.json
```
**Expected output:** a JSON summary record in `out_json.json`.
```bash
# Get a YAML record from a FASTQ file
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
```
**Expected output:** a YAML summary record in `out_yaml.yaml`.
```bash
# Summarise data read from standard input, forcing FASTA format
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
```
**Expected output:** a JSON summary record in `out_pipeline.yaml` (3 variants, 10 reads).
---
# SEE ALSO
`obiclean`, `obiuniq`, `obicount`
+347
View File
@@ -0,0 +1,347 @@
# NAME
obiuniq — dereplicate sequence data sets
---
# SYNOPSIS
```
obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
[--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
[--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
[--no-order] [--no-progressbar] [--no-singleton]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
`obiuniq` groups identical sequences together and replaces them with a single
representative, recording the total number of original occurrences as an
abundance count. This process — called dereplication — is a standard step in
amplicon sequencing workflows: it dramatically reduces the number of sequence
records to process, while preserving exact counts needed for downstream
statistical analyses.
By default, two sequences are considered identical if and only if their
nucleotide strings are the same. Using `--category-attribute` (repeatable),
additional metadata fields can be included in the identity criterion. For
example, grouping by sample name keeps the same sequence as separate records
when it occurs in different samples, enabling per-sample abundance tracking.
For each group of identical sequences, `obiuniq` emits one output record
carrying the merged metadata of all members. The `--merge` option (repeatable)
instructs the command to also record, in an attribute named `merged_<KEY>`, the
distribution of `KEY` attribute values across the sequences collapsed into each
group — useful for provenance tracking and quality control. <!-- corrected: actual attribute name is merged_KEY (not KEY); tracks attribute value distributions, not a list of sequence IDs -->
Sequences that appear only once in the entire dataset (singletons) can be
removed with `--no-singleton`. Singletons often represent sequencing errors
rather than genuine biological variants, so their removal is a common
noise-reduction step.
---
# INPUT
`obiuniq` accepts biological sequence data in FASTA, FASTQ, EMBL, GenBank,
ecoPCR, or CSV format (auto-detected by default, or forced with format flags
such as `--fasta`, `--fastq`, `--embl`, etc.). Input is read from one or more
files given as positional arguments, or from standard input when no files are
provided.
When multiple input files are provided, `obiuniq` assumes they are ordered
(e.g., paired-end reads in the same read order). If no such ordering exists,
use `--no-order` to signal that files can be consumed independently.
FASTA/FASTQ header annotations are parsed heuristically by default. Use
`--input-OBI-header` for OBI-formatted headers or `--input-json-header` for
JSON-formatted headers. RNA sequences can be normalised to DNA on the fly with
`--u-to-t`.
---
# OUTPUT
`obiuniq` writes dereplicated sequences to standard output or to the file
specified by `--out`. Each output record represents one group of identical
sequences (identical under the chosen grouping criterion). The output carries
the merged metadata from all input records in the group.
The output format defaults to FASTA. Even when the input contains quality
scores (FASTQ), quality information is not preserved across merged sequences,
so the output is written in FASTA format unless `--fastq-output` is explicitly
requested. <!-- corrected: actual output is always FASTA when dereplicating; quality scores are dropped during merging -->
Output annotations follow the OBI header format when `--output-OBI-header` is
set, or JSON when `--output-json-header` is set. The output can be
gzip-compressed with `--compress`.
For each output record:
- The abundance count reflects how many input sequences were merged into the
group.
- Attributes created by `--merge KEY` are named `merged_KEY` and map each
observed value of the `KEY` attribute to the count of input sequences
carrying that value within the group. <!-- corrected: attribute name is merged_KEY; value is a map not a list -->
- All other attributes are merged from the contributing records according to
the standard OBITools4 merging rules.
## Observed output example
```
>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt
```
---
# OPTIONS
## Dereplication Options
**`--category-attribute|-c <CATEGORY>`** (default: `[]`)
Adds one metadata attribute to the grouping criterion. Two sequences are
placed in the same group only when they are nucleotide-identical **and** share
the same value for every attribute listed with `-c`. This option can be
repeated to combine multiple attributes (e.g., `-c sample -c primer`).
Records that lack a listed attribute receive the value set by `--na-value`.
**`--chunk-count <int>`** (default: `100`)
Controls how many internal partitions the dataset is split into during
processing. A higher value reduces per-partition memory usage at the cost of
more temporary files; a lower value increases per-partition memory but reduces
I/O overhead. Tune this when processing very large or very small datasets.
**`--in-memory`** (default: `false`)
Stores intermediate data chunks in RAM rather than in temporary disk files.
Speeds up processing on datasets that fit comfortably in available memory;
omit this flag (the default) for large datasets that exceed available RAM.
**`--merge|-m <KEY>`** (default: `[]`)
Creates an output attribute named `merged_KEY` that maps each observed value
of the `KEY` attribute to the count of input sequences carrying that value
within the group. Repeat to track multiple attributes. <!-- corrected: actual attribute name is merged_KEY (not KEY); value is a map of attribute values to counts, not a list of sequence IDs -->
Useful for tracking which sample or category contributions were collapsed into each group.
**`--na-value <NA_NAME>`** (default: `"NA"`)
Value assigned to a category attribute when a sequence record does not carry
that attribute. All sequences lacking the attribute are grouped together under
this placeholder, rather than being treated as incomparable.
**`--no-singleton`** (default: `false`)
Discards all output records whose abundance count is exactly one — i.e.,
sequences that occur only once across the entire input. Removing singletons
is a standard heuristic for excluding sequencing errors from further analysis.
## Input Options
**`--batch-mem <string>`** (default: `""`, env: `OBIBATCHMEM`)
Maximum memory budget per processing batch (e.g. `128K`, `64M`, `1G`). Set
to `0` to disable the memory ceiling. Overrides `--batch-size-max` when
both are set.
**`--batch-size <int>`** (default: `10`, env: `OBIBATCHSIZE`)
Minimum number of sequences per batch (floor).
**`--batch-size-max <int>`** (default: `2000`, env: `OBIBATCHSIZEMAX`)
Maximum number of sequences per batch (ceiling).
**`--csv`** (default: `false`)
Parse input as CSV format.
**`--ecopcr`** (default: `false`)
Parse input as ecoPCR output format.
**`--embl`** (default: `false`)
Parse input as EMBL flatfile format.
**`--fasta`** (default: `false`)
Parse input as FASTA format.
**`--fastq`** (default: `false`)
Parse input as FASTQ format.
**`--genbank`** (default: `false`)
Parse input as GenBank flatfile format.
**`--input-OBI-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as OBI-format key=value pairs.
**`--input-json-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as JSON objects.
**`--no-order`** (default: `false`)
When multiple input files are provided, indicates that there is no ordering
relationship among them.
**`--skip-empty`** (default: `false`)
Suppress sequences of length zero from the output.
**`--solexa`** (default: `false`, env: `OBISOLEXA`)
Decode quality strings according to the Solexa specification rather than the
standard Phred encoding.
**`--u-to-t`** (default: `false`)
Convert uracil (U) to thymine (T) in all input sequences, normalising RNA to
DNA representation.
## Output Options
**`--compress|-Z`** (default: `false`)
Compress output using gzip.
**`--fasta-output`** (default: `false`)
Write output in FASTA format (default when no quality scores are available).
**`--fastq-output`** (default: `false`)
Write output in FASTQ format (default when quality scores are present).
**`--json-output`** (default: `false`)
Write output in JSON format.
**`--out|-o <FILENAME>`** (default: `"-"`)
Write output to the specified file instead of standard output.
**`--output-OBI-header|-O`** (default: `false`)
Write FASTA/FASTQ title line annotations in OBI format.
**`--output-json-header`** (default: `false`)
Write FASTA/FASTQ title line annotations in JSON format.
## Taxonomy Options
**`--fail-on-taxonomy`** (default: `false`)
Cause `obiuniq` to exit with an error if a taxid in the data is not a
currently valid taxon in the loaded taxonomy.
**`--raw-taxid`** (default: `false`)
Print taxids in output without supplementary information (taxon name and rank).
**`--taxonomy|-t <string>`** (default: `""`)
Path to the taxonomy database used to validate or update taxids.
**`--update-taxid`** (default: `false`)
Automatically replace merged taxids with the most recent valid taxid.
**`--with-leaves`** (default: `false`)
When taxonomy is extracted from a sequence file, add sequences as leaves of
their taxid annotation.
## Execution Options
**`--max-cpu <int>`** (default: `16`, env: `OBIMAXCPU`)
Number of parallel threads used to compute the result.
**`--debug`** (default: `false`, env: `OBIDEBUG`)
Enable debug mode by setting the log level to debug.
**`--no-progressbar`** (default: `false`)
Disable the progress bar.
**`--silent-warning`** (default: `false`, env: `OBIWARNING`)
Suppress warning messages.
**`--pprof`** (default: `false`)
Enable the pprof profiling server (address logged at startup).
**`--pprof-goroutine <int>`** (default: `6060`, env: `OBIPPROFGOROUTINE`)
Port for the goroutine blocking profile endpoint.
**`--pprof-mutex <int>`** (default: `10`, env: `OBIPPROFMUTEX`)
Rate for the mutex contention profile.
**`--version`** (default: `false`)
Print the version string and exit.
**`--help|-h|-?`** (default: `false`)
Print usage information and exit.
---
# EXAMPLES
```bash
# Dereplicate a FASTQ file of amplicon reads; write unique sequences to a FASTA output file.
obiuniq reads.fastq -o out_basic.fastq
```
**Expected output:** 4 sequences written to `out_basic.fastq`.
```bash
# Dereplicate keeping sequences separate per sample (category attribute),
# and discard singletons to remove likely sequencing errors.
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
```
**Expected output:** 2 sequences written to `out_no_singleton.fastq`.
```bash
# Dereplicate per sample, recording the sample distribution in 'merged_sample',
# and use 'UNKNOWN' for reads missing the sample attribute.
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
```
**Expected output:** 5 sequences written to `out_merge.fastq`.
```bash
# Process a dataset entirely in memory using 200 internal partitions,
# writing gzip-compressed output.
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
```
**Expected output:** 4 sequences written to `out_inmemory.fastq.gz`.
```bash
# Dereplicate reads from two sample files with no assumed ordering between them,
# grouping by both sample and primer attributes.
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
```
**Expected output:** 4 sequences written to `out_multifile.fastq`.
---
# SEE ALSO
- `obigrep` — filter dereplicated sequences by abundance, length, or annotation
- `obiannotate` — add or modify annotations on dereplicated records
- `obicount` — count sequences or groups in a dataset
- `obiclean` — remove sequencing artefacts from a dereplicated dataset
- `obisummary` — summarise annotation distributions across a sequence set
---
# NOTES
For datasets that do not fit in RAM, `obiuniq` uses temporary disk-backed
chunk files by default. The number of chunks is controlled by `--chunk-count`
(default 100). Increasing this value lowers per-chunk memory requirements;
decreasing it reduces I/O at the cost of higher peak memory. Use `--in-memory`
only when the full working set fits in available RAM, as exceeding memory will
degrade performance or cause out-of-memory failures.
Singletons (sequences with abundance = 1) are a common source of noise in
amplicon sequencing, often arising from PCR or sequencing errors. The
`--no-singleton` flag is therefore recommended for most metabarcoding
workflows, unless the study design requires retaining all observed variants.
When the `--category-attribute` option is used, records that lack the
specified attribute are grouped together under the `--na-value` placeholder
(default `"NA"`). This ensures that all records participate in dereplication
without being silently dropped, but users should be aware that heterogeneous
records with different missing attributes may be unintentionally merged.