Files
obitools4/autodoc/docmd/pkg/obiiter/batchiterator.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.4 KiB
Raw Blame History

obiiter: Stream-Based Biosequence Iterator Library

This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (BioSequence), designed for high-throughput NGS data pipelines.

Core Functionality

  • Batched Streaming: Reads sequences in configurable batches (BioSequenceBatch) via a channel-based iterator.
  • Thread Safety: Uses sync.WaitGroup, RWMutex, and atomic flags for safe concurrent access.
  • Lazy Evaluation: Iteration is on-demand via Next()/Get(), supporting memory-efficient processing.

Iterator Management

  • Construction: MakeIBioSequence() initializes a new iterator with default settings.
  • Lifecycle Control:
    • Add(n), Done(): Track active workers (like goroutines).
    • Lock/RLock and Unlock/RUnlock: Explicit synchronization.
    • Wait() / Close(), WaitAndClose(): Graceful shutdown.

Batch Transformation & Reorganization

  • Rebatch(size): Redistributes sequences into fixed-size batches (requires sorting).
  • RebatchBySize(maxBytes, maxCount): Dynamic batching respecting memory and count limits.
  • SortBatches(): Ensures batches are emitted in strict order (by order field).
  • Concatenation & Pooling:
    • Concat(...): Sequentially merges multiple iterators.
    • Pool(...): Interleaves batches from several sources (preserves order via renumbering).

Filtering & Predicate-Based Processing

  • FilterOn(pred, size): Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
  • FilterAnd(pred, size): Same as FilterOn, but also checks paired-end consistency.
  • DivideOn(pred, size): Splits input into two iterators (true, false) based on predicate.

Utility & Analysis

  • Load(): Collects all sequences into a single slice (for small datasets).
  • Count(recycle): Returns (variants, reads, nucleotides).
  • Consume() / Recycle(): Drains iterator, optionally triggering sequence recycling.
  • CompleteFileIterator(): Reads entire remaining file as one batch.

Additional Features

  • Supports paired-end data via MarkAsPaired() / IsPaired().
  • Batch ordering preserved for downstream reproducibility.
  • Integrates with OBITools4s obidefault, obiutils for config and resource management.

Designed for scalability, low memory footprint, and composability in bioinformatics workflows.