mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,41 @@
|
||||
# FASTQ Parsing Module (`obiformats`)
|
||||
|
||||
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **`EndOfLastFastqEntry(buffer []byte) int`**
|
||||
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
|
||||
|
||||
- **`FastqChunkParser(...)`**
|
||||
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
|
||||
- Header parsing (`@id [definition]`)
|
||||
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
|
||||
- Quality score shifting (`quality_shift`)
|
||||
- Strict validation (e.g., `+` line, matching sequence/length)
|
||||
|
||||
- **`FastqChunkParserRope(...)`**
|
||||
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
|
||||
|
||||
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
|
||||
Enables concurrent, chunked parsing of large files:
|
||||
- Splits input into chunks using `ReadFileChunk`
|
||||
- Uses configurable parallel workers (`nworker`)
|
||||
- Pushes parsed batches to an iterator interface
|
||||
|
||||
- **Convenience I/O Wrappers**
|
||||
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
|
||||
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
|
||||
|
||||
## Key Options & Features
|
||||
|
||||
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
|
||||
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
|
||||
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
|
||||
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
|
||||
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
|
||||
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
|
||||
Reference in New Issue
Block a user