- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.0 KiB
FASTA Parser Module (obiformats)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
Core Functionalities
-
FastaChunkParser(UtoT bool)
Returns a parser function for in-memory byte streams. ConvertsU→Tif enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries. -
FastaChunkParserRope(...)
Parses FASTA directly from aPieceOfChunkrope structure, avoiding full data materialization. Optimized for large files. -
ReadFasta(reader io.Reader, ...)
High-level API to parse FASTA from anyio.Reader. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing. -
ReadFastaFromFile(...)/ReadFastaFromStdin(...)
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling. -
EndOfLastFastaEntry(...)
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
Key Features
- Strict validation: Ensures entries start with
>, contain valid identifiers, and only use allowed sequence characters (a-z,- . [ ]). - Case normalization: Converts uppercase to lowercase; optional
U→Tconversion. - Whitespace handling: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- Parallel processing: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- Integration with
obiseq/obiiter: Yields typed sequence objects (BioSequence) and batched iterators compatible with OBITools4 pipelines.
Design Highlights
- Minimal allocations via rope-based parsing (
extractFastaSeq). - Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via
WithOptionpattern for header parsing and batching behavior.