Files
obitools4/autodoc/docmd/pkg/obiformats/file_chunk_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.0 KiB
Raw Blame History

Semantic Description of obiformats Package Functionalities

The obiformats package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.

  • PieceOfChunk: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (NewPieceOfChunk, Next()) and final packing into a contiguous slice via Pack().

  • FileChunk: Encapsulates one chunk of raw data (*bytes.Buffer) or its rope representation, tagged with source file name and positional order for ordered downstream processing.

  • ChannelFileChunk: A typed channel (chan FileChunk) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.

  • LastSeqRecord: A callback type (func([]byte) int) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.

  • ReadFileChunk(): Core function that:

    • Reads from an io.Reader in configurable chunks (fileChunkSize);
    • Uses a probe string (e.g., "@M0" for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
    • Extends chunks incrementally (e.g., +1MB) until a full record boundary is found via splitter;
    • Returns data as an ordered stream of FileChunks on a channel, closing it upon EOF;
    • Optionally packs rope buffers to contiguous memory (pack flag), balancing speed vs. RAM usage.
  • Key semantics:

    • Chunking by record integrity, not fixed byte size — prevents splitting biological entries.
    • Lazy evaluation: only reads ahead when needed to find record boundaries.
    • Streaming-first design — supports large files without full loading into memory.

This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.