mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.3 KiB
2.3 KiB
GenBank Parser Module (obiformats)
This Go package provides high-performance parsing of GenBank flat files, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
Core Functionalities
- State-machine parser: Processes GenBank records through well-defined states (
inHeader,inEntry,inFeature, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG). - Rope-aware parsing (
GenbankChunkParserRope): Directly parses from aPieceOfChunkrope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences. - Sequence extraction: Efficient byte-by-byte scanning of the
ORIGINsection, compacting bases and optionally converting uracil (u) to thymine (t). - Metadata extraction: Captures sequence ID, declared length (from LOCUS), scientific name (
SOURCE), and taxonomic ID (/db_xref="taxon:..."). - Optional feature table support: When enabled, stores raw FEATURES section content for downstream annotation processing.
- Parallel streaming I/O:
ReadGenbank()andReadGenbankFromFile()return an iterator (obiiter.IBioSequence) over parsed sequences.- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
Key Design Decisions
- Zero-copy where possible: Rope parser avoids
Pack()to prevent expensive reallocation. - Strict state validation: Logs fatal errors on unexpected line sequences (e.g.,
DEFINITIONoutside entry state). - Fallback parsing: Falls back to buffered I/O (
GenbankChunkParser) when rope data is unavailable. - U-to-T conversion: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
- Error resilience: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
Output
Returns a batched iterator of BioSequence objects, each containing:
- Identifier (
id) - Compact nucleotide sequence
- Definition line (as description)
- Source file origin
- Optional feature table bytes
- Annotations:
scientific_name,taxid
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).