mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.8 KiB
1.8 KiB
OBIFORMATS Package: Semantic Description
The obiformats package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
- FASTA (
text/fasta): identified by lines starting with>. - FASTQ (
text/fastq): detected via leading@characters. - ecoPCR2: recognized by the header line
#@ecopcr-v2. - EMBL (
text/embl): detected by lines starting withID. - GenBank (
text/genbank): identified by eitherLOCUSor legacy"Genetic Sequence Data Bank"headers. - CSV (
text/csv): generic tabular support.
Core functionality is exposed through:
OBIMimeTypeGuesser(): inspects the first ~1 MiB of an input stream to infer MIME type usinggithub.com/gabriel-vasile/mimetype, while preserving unread data for downstream processing.ReadSequencesFromFile(): reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g.,ReadFasta,ReadFastq).ReadSequencesFromStdin(): convenience wrapper to read from stdin, treating"-"as filename and auto-closing the stream.
Internally leverages:
obiutils.Ropen()for unified file opening (including stdin handling).- Path extension stripping and source tagging via
OptionsSource(). - Logging (
logrus) for format diagnostics. - Iterator interface (
obiiter.IBioSequence) to abstract sequential access over sequences.
The package ensures extensibility: new formats can be added by extending the switch dispatch in ReadSequencesFromFile() and registering corresponding MIME types.
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.