Files
obitools4/autodoc/docmd/pkg/obiformats/fastqseq_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

42 lines
2.1 KiB
Markdown

# FASTQ Parsing Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
## Core Functionalities
- **`EndOfLastFastqEntry(buffer []byte) int`**
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
- **`FastqChunkParser(...)`**
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
- Header parsing (`@id [definition]`)
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
- Quality score shifting (`quality_shift`)
- Strict validation (e.g., `+` line, matching sequence/length)
- **`FastqChunkParserRope(...)`**
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
Enables concurrent, chunked parsing of large files:
- Splits input into chunks using `ReadFileChunk`
- Uses configurable parallel workers (`nworker`)
- Pushes parsed batches to an iterator interface
- **Convenience I/O Wrappers**
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
## Key Options & Features
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
## Design Highlights
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.