mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
80 lines
4.5 KiB
Markdown
80 lines
4.5 KiB
Markdown
# Bioinformatics Sequence Processing Pipeline — Public API Overview
|
||
|
||
The `obiiter` package provides a high-performance, concurrent framework for processing biological sequence data (e.g., FASTQ/FASTA) in batched, streaming fashion. Built around the `IBioSequence` iterator interface and value-type batches (`BioSequenceBatch`), it supports scalable, traceable workflows with built-in memory control, threading safety, and functional composition.
|
||
|
||
## Core Abstractions
|
||
|
||
- **`IBioSequence`**: A concurrent iterator over `BioSequenceBatch`, enabling lazy, batched consumption.
|
||
- **`BioSequenceBatch`**: An immutable-friendly container holding ordered sequences with metadata (`source`, `order`). Supports FIFO popping, slicing, and pairing.
|
||
- **`Pipeable`**: A function type `func(IBioSequence) IBioSequence`, enabling composable transformations.
|
||
|
||
## Batch & Iterator Management
|
||
|
||
- `MakeIBioSequence(...)`: Constructs a new iterator (e.g., from files or slices).
|
||
- `Concat(...IBioSequence)`: Sequentially merges multiple iterators.
|
||
- `Pool(...)`: Interleaves batches from several sources, preserving global order via renumbering.
|
||
- `Rebatch(size)` / `RebatchBySize(maxBytes, maxCount)`: Dynamically regroups sequences into fixed or memory-bound batches.
|
||
- `SortBatches()`: Ensures strict ordering by batch metadata (`order` field).
|
||
- `CompleteFileIterator()`: Reads remaining file content as a single batch.
|
||
|
||
## Functional Transformations
|
||
|
||
- `MakeIWorker(...)`, `WorkerPipe(...)`: Applies per-sequence workers in parallel.
|
||
- `MakeISliceWorker(...)`, `SliceWorkerPipe(...)`: Applies batch-level (`SeqSliceWorker`) transformations.
|
||
- `MakeIConditionalWorker(...)`: Conditional worker application based on a predicate.
|
||
|
||
## Filtering & Splitting
|
||
|
||
- `FilterOn(pred, size)`: Parallel filtering with sequence recycling.
|
||
- `DivideOn(pred, size)`: Splits input into two independent iterators (`true`/`false` branches).
|
||
- `FilterAnd(pred, size)`: Same as above but enforces paired-end consistency.
|
||
|
||
## Memory & Performance Control
|
||
|
||
- `LimitMemory(fraction)`: Enforces heap usage ≤ fraction × total RAM via backpressure (uses `runtime.ReadMemStats()`).
|
||
- Parallel workers (`nworkers`) and batch sizes are configurable via defaults or variadic args.
|
||
|
||
## Paired-End Data Handling
|
||
|
||
All operations preserve pairing semantics:
|
||
- `IsPaired()`, `MarkAsPaired()` on iterators and batches.
|
||
- `PairTo(other)`: Synchronizes two batch/iterator pairs (same order required).
|
||
- `PairedWith()`, `UnPair()` for mate extraction and unpairing.
|
||
|
||
## Sequence Numbering & Annotation
|
||
|
||
- `NumberSequences(start, forceReordering)`: Assigns unique sequential IDs to sequences (same ID for mates in paired mode). Supports parallel or deterministic ordering.
|
||
- `MakeSetAttributeWorker(rank)`: Returns a worker that annotates each sequence with taxon at specified rank (e.g., `"species"`).
|
||
|
||
## Taxonomic Profiling
|
||
|
||
- `ExtractTaxonomy(iterator, seqAsTaxa)`: Aggregates taxonomy across all sequences via `.Slice().ExtractTaxonomy()` calls. Implements map-reduce semantics for scalable taxonomic summarization.
|
||
|
||
## Fragmentation
|
||
|
||
- `IFragments(minsize, length, overlap)`: Splits long sequences into overlapping fragments (fusion mode for remainder), with parallel workers and memory-efficient recycling.
|
||
|
||
## Utility & Analysis
|
||
|
||
- `Load()`: Collects all sequences into a slice (for small data).
|
||
- `Count(recycle)`: Returns `(variants, reads, nucleotides)` counts.
|
||
- `Consume()` / `Recycle()`: Drains iterator and optionally triggers sequence recycling.
|
||
|
||
## Pipeline & Teeing
|
||
|
||
- `Pipeline(start, parts...)`: Composes a chain of `Pipeable` transformations.
|
||
- `(IBioSequence).Pipe(...)`: Fluent method chaining for pipelines.
|
||
- `Teeable` / `(IBioSequence).CopyTee()`: Duplicates stream into two independent, concurrently readable iterators (preserves pairing).
|
||
|
||
## Progress & Logging
|
||
|
||
- `Speed()`, `SpeedPipe()`: Adds a non-intrusive progress bar (stderr only, terminal-aware). Updates per batch and respects `--no-progressbar` flag.
|
||
|
||
## Distribution & Routing
|
||
|
||
- **`IDistribute(classifier, batchSize)`**: Routes sequences to classified outputs based on a classifier function. Batches per class key are flushed when size or memory thresholds are reached.
|
||
- `News()` channel notifies on new output streams (i.e., newly seen class keys).
|
||
- Thread-safe, async distribution via goroutines.
|
||
|
||
All public APIs assume interoperability with `obiseq`, `obitax`, and OBITools4’s config modules (`obidefault`, `obilog`). Design emphasizes immutability-by-copy, safe concurrent access (via mutexes/atomics), and composability for reproducible bioinformatics pipelines.
|