Files
obitools4/autodoc/docmd/pkg/obiformats/fastaseq_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.0 KiB

FASTA Parser Module (obiformats)

This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.

Core Functionalities

  • FastaChunkParser(UtoT bool)
    Returns a parser function for in-memory byte streams. Converts U→T if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.

  • FastaChunkParserRope(...)
    Parses FASTA directly from a PieceOfChunk rope structure, avoiding full data materialization. Optimized for large files.

  • ReadFasta(reader io.Reader, ...)
    High-level API to parse FASTA from any io.Reader. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.

  • ReadFastaFromFile(...) / ReadFastaFromStdin(...)
    Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.

  • EndOfLastFastaEntry(...)
    Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.

Key Features

  • Strict validation: Ensures entries start with >, contain valid identifiers, and only use allowed sequence characters (a-z, - . [ ]).
  • Case normalization: Converts uppercase to lowercase; optional U→T conversion.
  • Whitespace handling: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
  • Parallel processing: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
  • Integration with obiseq/obiiter: Yields typed sequence objects (BioSequence) and batched iterators compatible with OBITools4 pipelines.

Design Highlights

  • Minimal allocations via rope-based parsing (extractFastaSeq).
  • Graceful error reporting with context (source, identifier, invalid char position).
  • Extensible via WithOption pattern for header parsing and batching behavior.