Files
obitools4/autodoc/docmd/pkg/obiformats/universal_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB

OBIFORMATS Package: Semantic Description

The obiformats package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.

It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:

  • FASTA (text/fasta): identified by lines starting with >.
  • FASTQ (text/fastq): detected via leading @ characters.
  • ecoPCR2: recognized by the header line #@ecopcr-v2.
  • EMBL (text/embl): detected by lines starting with ID .
  • GenBank (text/genbank): identified by either LOCUS or legacy "Genetic Sequence Data Bank" headers.
  • CSV (text/csv): generic tabular support.

Core functionality is exposed through:

  • OBIMimeTypeGuesser(): inspects the first ~1 MiB of an input stream to infer MIME type using github.com/gabriel-vasile/mimetype, while preserving unread data for downstream processing.
  • ReadSequencesFromFile(): reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., ReadFasta, ReadFastq).
  • ReadSequencesFromStdin(): convenience wrapper to read from stdin, treating "-" as filename and auto-closing the stream.

Internally leverages:

  • obiutils.Ropen() for unified file opening (including stdin handling).
  • Path extension stripping and source tagging via OptionsSource().
  • Logging (logrus) for format diagnostics.
  • Iterator interface (obiiter.IBioSequence) to abstract sequential access over sequences.

The package ensures extensibility: new formats can be added by extending the switch dispatch in ReadSequencesFromFile() and registering corresponding MIME types.

Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.