mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
48 lines
2.4 KiB
Markdown
48 lines
2.4 KiB
Markdown
# `obiiter`: Stream-Based Biosequence Iterator Library
|
||
|
||
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
|
||
|
||
## Core Functionality
|
||
|
||
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
|
||
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
|
||
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
|
||
|
||
## Iterator Management
|
||
|
||
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
|
||
- **Lifecycle Control**:
|
||
- `Add(n)`, `Done()`: Track active workers (like goroutines).
|
||
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
|
||
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
|
||
|
||
## Batch Transformation & Reorganization
|
||
|
||
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
|
||
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
|
||
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
|
||
- **Concatenation & Pooling**:
|
||
- `Concat(...)`: Sequentially merges multiple iterators.
|
||
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
|
||
|
||
## Filtering & Predicate-Based Processing
|
||
|
||
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
|
||
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
|
||
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
|
||
|
||
## Utility & Analysis
|
||
|
||
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
|
||
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
|
||
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
|
||
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
|
||
|
||
## Additional Features
|
||
|
||
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
|
||
- Batch ordering preserved for downstream reproducibility.
|
||
- Integrates with OBITools4’s `obidefault`, `obiutils` for config and resource management.
|
||
|
||
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
|