Files
obitools4/autodoc/docmd/pkg/obiformats/fastaseq_read.md
T

35 lines
2.0 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# FASTA Parser Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
## Core Functionalities
- **`FastaChunkParser(UtoT bool)`**
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
- **`FastaChunkParserRope(...)`**
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
- **`ReadFasta(reader io.Reader, ...)`**
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
- **`EndOfLastFastaEntry(...)`**
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
## Key Features
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
## Design Highlights
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
- Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via `WithOption` pattern for header parsing and batching behavior.