mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
4.5 KiB
4.5 KiB
Bioinformatics Sequence Processing Pipeline — Public API Overview
The obiiter package provides a high-performance, concurrent framework for processing biological sequence data (e.g., FASTQ/FASTA) in batched, streaming fashion. Built around the IBioSequence iterator interface and value-type batches (BioSequenceBatch), it supports scalable, traceable workflows with built-in memory control, threading safety, and functional composition.
Core Abstractions
IBioSequence: A concurrent iterator overBioSequenceBatch, enabling lazy, batched consumption.BioSequenceBatch: An immutable-friendly container holding ordered sequences with metadata (source,order). Supports FIFO popping, slicing, and pairing.Pipeable: A function typefunc(IBioSequence) IBioSequence, enabling composable transformations.
Batch & Iterator Management
MakeIBioSequence(...): Constructs a new iterator (e.g., from files or slices).Concat(...IBioSequence): Sequentially merges multiple iterators.Pool(...): Interleaves batches from several sources, preserving global order via renumbering.Rebatch(size)/RebatchBySize(maxBytes, maxCount): Dynamically regroups sequences into fixed or memory-bound batches.SortBatches(): Ensures strict ordering by batch metadata (orderfield).CompleteFileIterator(): Reads remaining file content as a single batch.
Functional Transformations
MakeIWorker(...),WorkerPipe(...): Applies per-sequence workers in parallel.MakeISliceWorker(...),SliceWorkerPipe(...): Applies batch-level (SeqSliceWorker) transformations.MakeIConditionalWorker(...): Conditional worker application based on a predicate.
Filtering & Splitting
FilterOn(pred, size): Parallel filtering with sequence recycling.DivideOn(pred, size): Splits input into two independent iterators (true/falsebranches).FilterAnd(pred, size): Same as above but enforces paired-end consistency.
Memory & Performance Control
LimitMemory(fraction): Enforces heap usage ≤ fraction × total RAM via backpressure (usesruntime.ReadMemStats()).- Parallel workers (
nworkers) and batch sizes are configurable via defaults or variadic args.
Paired-End Data Handling
All operations preserve pairing semantics:
IsPaired(),MarkAsPaired()on iterators and batches.PairTo(other): Synchronizes two batch/iterator pairs (same order required).PairedWith(),UnPair()for mate extraction and unpairing.
Sequence Numbering & Annotation
NumberSequences(start, forceReordering): Assigns unique sequential IDs to sequences (same ID for mates in paired mode). Supports parallel or deterministic ordering.MakeSetAttributeWorker(rank): Returns a worker that annotates each sequence with taxon at specified rank (e.g.,"species").
Taxonomic Profiling
ExtractTaxonomy(iterator, seqAsTaxa): Aggregates taxonomy across all sequences via.Slice().ExtractTaxonomy()calls. Implements map-reduce semantics for scalable taxonomic summarization.
Fragmentation
IFragments(minsize, length, overlap): Splits long sequences into overlapping fragments (fusion mode for remainder), with parallel workers and memory-efficient recycling.
Utility & Analysis
Load(): Collects all sequences into a slice (for small data).Count(recycle): Returns(variants, reads, nucleotides)counts.Consume()/Recycle(): Drains iterator and optionally triggers sequence recycling.
Pipeline & Teeing
Pipeline(start, parts...): Composes a chain ofPipeabletransformations.(IBioSequence).Pipe(...): Fluent method chaining for pipelines.Teeable/(IBioSequence).CopyTee(): Duplicates stream into two independent, concurrently readable iterators (preserves pairing).
Progress & Logging
Speed(),SpeedPipe(): Adds a non-intrusive progress bar (stderr only, terminal-aware). Updates per batch and respects--no-progressbarflag.
Distribution & Routing
IDistribute(classifier, batchSize): Routes sequences to classified outputs based on a classifier function. Batches per class key are flushed when size or memory thresholds are reached.News()channel notifies on new output streams (i.e., newly seen class keys).- Thread-safe, async distribution via goroutines.
All public APIs assume interoperability with obiseq, obitax, and OBITools4’s config modules (obidefault, obilog). Design emphasizes immutability-by-copy, safe concurrent access (via mutexes/atomics), and composability for reproducible bioinformatics pipelines.