mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
35 lines
2.0 KiB
Markdown
35 lines
2.0 KiB
Markdown
# FASTA Parser Module (`obiformats`)
|
|
|
|
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
|
|
|
|
## Core Functionalities
|
|
|
|
- **`FastaChunkParser(UtoT bool)`**
|
|
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
|
|
|
|
- **`FastaChunkParserRope(...)`**
|
|
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
|
|
|
|
- **`ReadFasta(reader io.Reader, ...)`**
|
|
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
|
|
|
|
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
|
|
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
|
|
|
|
- **`EndOfLastFastaEntry(...)`**
|
|
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
|
|
|
|
## Key Features
|
|
|
|
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
|
|
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
|
|
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
|
|
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
|
|
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
|
|
|
|
## Design Highlights
|
|
|
|
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
|
|
- Graceful error reporting with context (source, identifier, invalid char position).
|
|
- Extensible via `WithOption` pattern for header parsing and batching behavior.
|