- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.0 KiB
Semantic Description of obiformats Package Functionalities
The obiformats package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
-
PieceOfChunk: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (NewPieceOfChunk,Next()) and final packing into a contiguous slice viaPack(). -
FileChunk: Encapsulates one chunk of raw data (*bytes.Buffer) or its rope representation, tagged with source file name and positional order for ordered downstream processing. -
ChannelFileChunk: A typed channel (chan FileChunk) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows. -
LastSeqRecord: A callback type (func([]byte) int) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries. -
ReadFileChunk(): Core function that:- Reads from an
io.Readerin configurable chunks (fileChunkSize); - Uses a probe string (e.g.,
"@M0"for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing; - Extends chunks incrementally (e.g., +1 MB) until a full record boundary is found via
splitter; - Returns data as an ordered stream of
FileChunks on a channel, closing it upon EOF; - Optionally packs rope buffers to contiguous memory (
packflag), balancing speed vs. RAM usage.
- Reads from an
-
Key semantics:
- Chunking by record integrity, not fixed byte size — prevents splitting biological entries.
- Lazy evaluation: only reads ahead when needed to find record boundaries.
- Streaming-first design — supports large files without full loading into memory.
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.