Files
obitools4/autodoc/docmd/pkg_obiiter.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

80 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bioinformatics Sequence Processing Pipeline — Public API Overview
The `obiiter` package provides a high-performance, concurrent framework for processing biological sequence data (e.g., FASTQ/FASTA) in batched, streaming fashion. Built around the `IBioSequence` iterator interface and value-type batches (`BioSequenceBatch`), it supports scalable, traceable workflows with built-in memory control, threading safety, and functional composition.
## Core Abstractions
- **`IBioSequence`**: A concurrent iterator over `BioSequenceBatch`, enabling lazy, batched consumption.
- **`BioSequenceBatch`**: An immutable-friendly container holding ordered sequences with metadata (`source`, `order`). Supports FIFO popping, slicing, and pairing.
- **`Pipeable`**: A function type `func(IBioSequence) IBioSequence`, enabling composable transformations.
## Batch & Iterator Management
- `MakeIBioSequence(...)`: Constructs a new iterator (e.g., from files or slices).
- `Concat(...IBioSequence)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources, preserving global order via renumbering.
- `Rebatch(size)` / `RebatchBySize(maxBytes, maxCount)`: Dynamically regroups sequences into fixed or memory-bound batches.
- `SortBatches()`: Ensures strict ordering by batch metadata (`order` field).
- `CompleteFileIterator()`: Reads remaining file content as a single batch.
## Functional Transformations
- `MakeIWorker(...)`, `WorkerPipe(...)`: Applies per-sequence workers in parallel.
- `MakeISliceWorker(...)`, `SliceWorkerPipe(...)`: Applies batch-level (`SeqSliceWorker`) transformations.
- `MakeIConditionalWorker(...)`: Conditional worker application based on a predicate.
## Filtering & Splitting
- `FilterOn(pred, size)`: Parallel filtering with sequence recycling.
- `DivideOn(pred, size)`: Splits input into two independent iterators (`true`/`false` branches).
- `FilterAnd(pred, size)`: Same as above but enforces paired-end consistency.
## Memory & Performance Control
- `LimitMemory(fraction)`: Enforces heap usage ≤ fraction × total RAM via backpressure (uses `runtime.ReadMemStats()`).
- Parallel workers (`nworkers`) and batch sizes are configurable via defaults or variadic args.
## Paired-End Data Handling
All operations preserve pairing semantics:
- `IsPaired()`, `MarkAsPaired()` on iterators and batches.
- `PairTo(other)`: Synchronizes two batch/iterator pairs (same order required).
- `PairedWith()`, `UnPair()` for mate extraction and unpairing.
## Sequence Numbering & Annotation
- `NumberSequences(start, forceReordering)`: Assigns unique sequential IDs to sequences (same ID for mates in paired mode). Supports parallel or deterministic ordering.
- `MakeSetAttributeWorker(rank)`: Returns a worker that annotates each sequence with taxon at specified rank (e.g., `"species"`).
## Taxonomic Profiling
- `ExtractTaxonomy(iterator, seqAsTaxa)`: Aggregates taxonomy across all sequences via `.Slice().ExtractTaxonomy()` calls. Implements map-reduce semantics for scalable taxonomic summarization.
## Fragmentation
- `IFragments(minsize, length, overlap)`: Splits long sequences into overlapping fragments (fusion mode for remainder), with parallel workers and memory-efficient recycling.
## Utility & Analysis
- `Load()`: Collects all sequences into a slice (for small data).
- `Count(recycle)`: Returns `(variants, reads, nucleotides)` counts.
- `Consume()` / `Recycle()`: Drains iterator and optionally triggers sequence recycling.
## Pipeline & Teeing
- `Pipeline(start, parts...)`: Composes a chain of `Pipeable` transformations.
- `(IBioSequence).Pipe(...)`: Fluent method chaining for pipelines.
- `Teeable` / `(IBioSequence).CopyTee()`: Duplicates stream into two independent, concurrently readable iterators (preserves pairing).
## Progress & Logging
- `Speed()`, `SpeedPipe()`: Adds a non-intrusive progress bar (stderr only, terminal-aware). Updates per batch and respects `--no-progressbar` flag.
## Distribution & Routing
- **`IDistribute(classifier, batchSize)`**: Routes sequences to classified outputs based on a classifier function. Batches per class key are flushed when size or memory thresholds are reached.
- `News()` channel notifies on new output streams (i.e., newly seen class keys).
- Thread-safe, async distribution via goroutines.
All public APIs assume interoperability with `obiseq`, `obitax`, and OBITools4s config modules (`obidefault`, `obilog`). Design emphasizes immutability-by-copy, safe concurrent access (via mutexes/atomics), and composability for reproducible bioinformatics pipelines.