Files
obitools4/autodoc/cmd/obisummary.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

7.6 KiB

NAME

obisummary — resume main information from a sequence file


SYNOPSIS

obisummary [--batch-mem <string>] [--batch-size <int>]
           [--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
           [--fasta] [--fastq] [--genbank] [--help|-h|-?]
           [--input-OBI-header] [--input-json-header] [--json-output]
           [--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
           [--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
           [--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]

DESCRIPTION

obisummary reads a set of biological sequences and computes a statistical summary of their content and annotations. Rather than producing a new sequence file, it outputs a single structured record describing the dataset as a whole.

The summary covers three main areas. First, global counts: the total number of reads (sequences weighted by their count attribute), the number of distinct sequence variants, and the total sequence length across all records. Second, annotation profiling: obisummary inspects every annotation key present in the dataset and classifies it as a scalar attribute (single value per sequence), a map attribute (key-to-count mapping), or a vector attribute (multi-value per sequence). Third, per-sample statistics: when sequences carry sample information (via merged_sample or equivalent per-sample annotations), obisummary reports for each sample the number of reads, the number of variants, and the number of singletons. If obiclean has been run previously, the summary also captures obiclean_status and related quality flags per sample.

The output is a single JSON record by default, or YAML when --yaml-output is requested. obisummary is typically used after processing steps such as obiclean or obiuniq to quickly validate the state of a dataset before downstream analysis.


INPUT

obisummary accepts biological sequence data from one or more files supplied as positional arguments, or from standard input when no files are given. Supported formats include FASTA, FASTQ, GenBank flatfile, EMBL flatfile, ecoPCR output, and CSV. By default the format is detected automatically; use the format flags (--fasta, --fastq, --genbank, --embl, --ecopcr, --csv) to force a specific parser.

FASTA/FASTQ annotation headers may follow the OBI format (--input-OBI-header) or JSON format (--input-json-header). RNA sequences can be read as DNA by converting uracil to thymine with --u-to-t. Quality strings encoded according to the Solexa specification are handled with --solexa.

When multiple input files are provided, obisummary assumes they are ordered; use --no-order to indicate that no ordering exists among them.


OUTPUT

obisummary writes a single structured record to standard output. The default format is JSON; use --yaml-output to obtain YAML instead.

The record contains three top-level sections:

  • count: global metrics including variants (distinct sequences), reads (total weighted count), and total_length (sum of all sequence lengths).

  • annotations: a breakdown of all annotation keys found in the dataset, classified as scalar_attributes, map_attributes, or vector_attributes, together with the observed keys and their occurrence counts within each category.

  • samples: when sample information is present, sample_count and a per-sample sample_stats table with reads, variants, and singletons fields. If obiclean data is present, an obiclean_bad field is also reported per sample.

When --map is used, the named map attribute is included in the annotation detail for that attribute.

Observed output example

{
  "annotations": {
    "keys": {
      "scalar": {
        "count": 5
      }
    },
    "map_attributes": 0,
    "scalar_attributes": 1,
    "vector_attributes": 0
  },
  "count": {
    "reads": 21,
    "total_length": 100,
    "variants": 5
  }
}

OPTIONS

Summary output

--json-output

  • Default: false
  • Print the result as a JSON record (this is the default behaviour; this flag makes the choice explicit).

--yaml-output

  • Default: false
  • Print the result as a YAML record instead of the default JSON format.

--map <string>

  • Default: [] (none)
  • Name of a map attribute to include in the summary. This option may be repeated to request multiple map attributes. Each named attribute will be detailed in the map_attributes section of the output.

Input format

--fasta

  • Default: false
  • Read data following the FASTA format.

--fastq

  • Default: false
  • Read data following the FASTQ format.

--genbank

  • Default: false
  • Read data following the GenBank flatfile format.

--embl

  • Default: false
  • Read data following the EMBL flatfile format.

--ecopcr

  • Default: false
  • Read data following the ecoPCR output format.

--csv

  • Default: false
  • Read data following the CSV format.

--input-OBI-header

  • Default: false
  • FASTA/FASTQ title line annotations follow OBI format.

--input-json-header

  • Default: false
  • FASTA/FASTQ title line annotations follow JSON format.

--solexa

  • Default: false
  • Decode quality strings according to the Solexa specification.

--u-to-t

  • Default: false
  • Convert uracil (U) to thymine (T) when reading RNA sequences.

Batch control

--batch-size <int>

  • Default: 1
  • Minimum number of sequences per processing batch.

--batch-size-max <int>

  • Default: 2000
  • Maximum number of sequences per processing batch.

--batch-mem <string>

  • Default: "" (128M effective)
  • Maximum memory per batch (e.g. 128K, 64M, 1G). Set to 0 to disable the memory limit.

Processing

--max-cpu <int>

  • Default: 16
  • Number of parallel threads used to compute the result.

--no-order

  • Default: false
  • When several input files are provided, indicates that there is no order among them.

General

--debug

  • Default: false
  • Enable debug mode by setting the log level to debug.

--silent-warning

  • Default: false
  • Stop printing warning messages.

--version

  • Default: false
  • Print the version and exit.

--help / -h / -?

  • Default: false
  • Display help and exit.

--pprof

  • Default: false
  • Enable the pprof profiling server. Consult the log for the server address.

--pprof-goroutine <int>

  • Default: 6060
  • Port for goroutine blocking profile.

--pprof-mutex <int>

  • Default: 10
  • Port for mutex lock profiling.

EXAMPLES

# Get a JSON summary of a FASTA file produced by obiclean
obisummary cleaned.fasta > out_default.yaml

Expected output: a JSON summary record in out_default.yaml.

# Get the summary as an explicit JSON record for programmatic processing
obisummary --json-output cleaned.fasta > out_json.json

Expected output: a JSON summary record in out_json.json.

# Get a YAML record from a FASTQ file
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml

Expected output: a YAML summary record in out_yaml.yaml.

# Summarise data read from standard input, forcing FASTA format
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml

Expected output: a JSON summary record in out_pipeline.yaml (3 variants, 10 reads).


SEE ALSO

obiclean, obiuniq, obicount