Files
obitools4/autodoc/cmd/obicsv.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

10 KiB

NAME

obicsv — converts sequence files to CSV format


SYNOPSIS

obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
       [--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
       [--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
       [--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
       [--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
       [--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
       [--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
       [--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
       [--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
       [--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]

DESCRIPTION

obicsv converts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.

Columns must be explicitly selected: use --ids for the identifier, --sequence for the nucleotide sequence, --quality for quality scores, --taxon for taxonomic information, --auto to auto-detect annotation attributes, or --keep for specific named attributes. Multiple flags can be combined freely.

The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.


INPUT

obicsv accepts input from files or stdin. The input format is automatically detected based on the file extension, but can be explicitly specified using format flags.

Supported input formats:

  • FASTA (--fasta)
  • FASTQ (--fastq)
  • GenBank (--genbank)
  • EMBL (--embl)
  • ecoPCR output (--ecopcr)
  • CSV (--csv)

Input sources:

  • Local files (specified as arguments)
  • stdin (when no input file is provided)
  • Remote URLs (http://, https://, ftp://)
  • Directories (automatically scanned for valid files)

Header formats:

  • OBI format (--input-OBI-header)
  • JSON format (--input-json-header)
  • Auto-detection (default)

Taxonomy database can be provided with --taxonomy|-t.


OUTPUT

The output is a CSV file with one row per sequence. The columns included depend on the flags used:

Column Flag Description
id --ids|-i Sequence identifier
sequence --sequence|-s DNA/RNA sequence
qualities --quality|-q Quality scores (ASCII-encoded)
definition --definition|-d Sequence description/annotation
count --count Number of reads represented by this sequence
taxid --taxon NCBI taxonomy identifier
scientific_name --taxon Taxonomic scientific name
custom attributes --keep|-k Any attribute stored in sequence annotations

If --auto is used, columns are automatically determined based on the attributes present in the first batch of sequences.

Missing values are written as the NA value (default: "NA").

Observed output example

id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

OPTIONS

Output Columns

These flags control which columns appear in the CSV output.

  • --ids|-i

    • Default: false
    • Meaning: Include the sequence identifier column. Useful for tracking or linking sequences.
  • --sequence|-s

    • Default: false
    • Meaning: Include the nucleotide or amino acid sequence. This is the main biological data.
  • --quality|-q

    • Default: false
    • Meaning: Include quality scores for each position. Essential for quality control and filtering.
  • --definition|-d

    • Default: false
    • Meaning: Include the sequence description or definition from the source file.
  • --count

    • Default: false
    • Meaning: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
  • --taxon

    • Default: false
    • Meaning: Include taxonomic information. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see --taxonomy).
  • --obipairing

    • Default: false
    • Meaning: Include attributes that were added by the obipairing command (pairing scores, mismatches, etc.).
  • --auto

    • Default: false
    • Meaning: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with --ids, --sequence, etc. to add those columns on top of the auto-detected ones.
  • --keep|-k <KEY>

    • Default: none
    • Meaning: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations.
  • --na-value <NAVALUE>

    • Default: "NA"
    • Meaning: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, "NA", "null").

Input/Output Files

  • --out|-o <FILENAME>

    • Default: "-" (stdout)
    • Meaning: Write output to the specified file instead of stdout.
  • --compress|-Z

    • Default: false
    • Meaning: Compress the output using gzip.

Input Format

  • --fasta, --fastq, --genbank, --embl, --ecopcr, --csv

    • Default: auto-detection
    • Meaning: Explicitly specify the input format.
  • --input-OBI-header, --input-json-header

    • Default: auto-detection
    • Meaning: Specify the header format in FASTA/FASTQ files (OBI or JSON annotations).
  • --u-to-t

    • Default: false
    • Meaning: Convert Uracil to Thymine. Useful for RNA sequences.
  • --solexa

    • Default: false
    • Meaning: Decode quality strings according to the Solexa specification instead of Phred.

Taxonomy

  • --taxonomy|-t <string>

    • Default: ""
    • Meaning: Path to the taxonomy database directory. Required for --taxon output.
  • --fail-on-taxonomy

    • Default: false
    • Meaning: Make OBITools fail if a used taxid is not currently valid.
  • --update-taxid

    • Default: false
    • Meaning: Automatically update taxids that have been merged to their newest valid taxid.
  • --raw-taxid

    • Default: false
    • Meaning: Print only taxids without supplementary information (name and rank).
  • --with-leaves

    • Default: false
    • Meaning: Add sequences as leaves of their taxid annotation when taxonomy is extracted from a sequence file.

Performance

  • --max-cpu <int>

    • Default: 16
    • Meaning: Number of parallel threads for processing.
  • --batch-size <int>

    • Default: 1
    • Meaning: Minimum number of sequences per batch.
  • --batch-size-max <int>

    • Default: 2000
    • Meaning: Maximum number of sequences per batch.
  • --batch-mem <string>

    • Default: "128M"
    • Meaning: Maximum memory per batch (e.g., 128K, 64M, 1G).
  • --no-order

    • Default: false
    • Meaning: When multiple input files are provided, indicates there is no order among them.
  • --no-progressbar

    • Default: false
    • Meaning: Disable the progress bar.

Other Options

  • --debug

    • Default: false
    • Meaning: Enable debug mode by setting log level to debug.
  • --pprof

    • Default: false
    • Meaning: Enable pprof server.
  • --pprof-goroutine <int>

    • Default: 6060
    • Meaning: Enable profiling of goroutine blocking.
  • --pprof-mutex <int>

    • Default: 10
    • Meaning: Enable profiling of mutex lock.
  • --silent-warning

    • Default: false
    • Meaning: Suppress warning messages.
  • --version

    • Default: false
    • Meaning: Print version information and exit.
  • --help|-h|-?

    • Default: false
    • Meaning: Print help information.

EXAMPLES

Export sequences with identifiers to CSV

Extracts sequence IDs and sequences from a FASTQ file.

obicsv --ids --sequence sequences.fastq -o output1.csv

Expected output: 3 sequences written to output1.csv.

Export sequences with quality scores

Useful for quality control and filtering in downstream tools.

obicsv --ids --sequence --quality sequences.fastq -o output2.csv

Expected output: 3 sequences written to output2.csv.

Export with taxonomic information

Includes taxid and scientific name for taxonomic analysis.

obicsv --ids --sequence --taxon --taxonomy /path/to/taxonomy sequences.fasta -o output.csv

Auto-detect annotation columns from sequence headers

Automatically discovers all annotation attributes present in the sequence headers and outputs them as CSV columns. Combined with --ids to also include the sequence identifier.

obicsv --auto --ids sequences.fasta -o output4.csv

Expected output: 3 rows in output4.csv with columns id, sample, taxid (attributes found in sequence headers).

Extract specific attributes

Keeps only the specified attributes as columns. Attributes not present in a sequence are written as the NA value.

obicsv --keep sample --keep taxid sequences.fasta -o output5.csv

Expected output: 3 rows in output5.csv with columns taxid, sample.

Export with compression

Writes gzip-compressed CSV output for large datasets.

obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz

Expected output: 3 sequences written to output6.csv.gz.


SEE ALSO

  • obiconvert — input/output handling framework
  • obipairing — pairing information (used with --obipairing)
  • Other export commands: obifasta, obifastq, obijson

NOTES

  • Without any column selection flag (--ids, --sequence, --quality, --taxon, --auto, --keep), the output contains no columns and no data.
  • The --taxon option requires a valid taxonomy database specified with --taxonomy.
  • Output is written to stdout by default; use --out to write to a file.
  • Missing attributes are written as the NA value (customizable with --na-value).
  • Input sequences are processed using streaming iterators to minimize memory footprint, even for large files.