mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
48 lines
2.4 KiB
Markdown
48 lines
2.4 KiB
Markdown
|
|
# `obiiter`: Stream-Based Biosequence Iterator Library
|
|||
|
|
|
|||
|
|
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
|
|||
|
|
|
|||
|
|
## Core Functionality
|
|||
|
|
|
|||
|
|
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
|
|||
|
|
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
|
|||
|
|
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
|
|||
|
|
|
|||
|
|
## Iterator Management
|
|||
|
|
|
|||
|
|
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
|
|||
|
|
- **Lifecycle Control**:
|
|||
|
|
- `Add(n)`, `Done()`: Track active workers (like goroutines).
|
|||
|
|
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
|
|||
|
|
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
|
|||
|
|
|
|||
|
|
## Batch Transformation & Reorganization
|
|||
|
|
|
|||
|
|
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
|
|||
|
|
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
|
|||
|
|
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
|
|||
|
|
- **Concatenation & Pooling**:
|
|||
|
|
- `Concat(...)`: Sequentially merges multiple iterators.
|
|||
|
|
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
|
|||
|
|
|
|||
|
|
## Filtering & Predicate-Based Processing
|
|||
|
|
|
|||
|
|
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
|
|||
|
|
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
|
|||
|
|
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
|
|||
|
|
|
|||
|
|
## Utility & Analysis
|
|||
|
|
|
|||
|
|
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
|
|||
|
|
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
|
|||
|
|
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
|
|||
|
|
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
|
|||
|
|
|
|||
|
|
## Additional Features
|
|||
|
|
|
|||
|
|
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
|
|||
|
|
- Batch ordering preserved for downstream reproducibility.
|
|||
|
|
- Integrates with OBITools4’s `obidefault`, `obiutils` for config and resource management.
|
|||
|
|
|
|||
|
|
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
|