Files
obitools4/autodoc/docmd/pkg/obiiter/batchiterator.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

48 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obiiter`: Stream-Based Biosequence Iterator Library
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
## Core Functionality
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
## Iterator Management
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
- **Lifecycle Control**:
- `Add(n)`, `Done()`: Track active workers (like goroutines).
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
## Batch Transformation & Reorganization
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
- **Concatenation & Pooling**:
- `Concat(...)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
## Filtering & Predicate-Based Processing
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
## Utility & Analysis
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
## Additional Features
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
- Batch ordering preserved for downstream reproducibility.
- Integrates with OBITools4s `obidefault`, `obiutils` for config and resource management.
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.