Files
obitools4/autodoc/docmd/pkg_obiiter.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

4.5 KiB
Raw Blame History

Bioinformatics Sequence Processing Pipeline — Public API Overview

The obiiter package provides a high-performance, concurrent framework for processing biological sequence data (e.g., FASTQ/FASTA) in batched, streaming fashion. Built around the IBioSequence iterator interface and value-type batches (BioSequenceBatch), it supports scalable, traceable workflows with built-in memory control, threading safety, and functional composition.

Core Abstractions

  • IBioSequence: A concurrent iterator over BioSequenceBatch, enabling lazy, batched consumption.
  • BioSequenceBatch: An immutable-friendly container holding ordered sequences with metadata (source, order). Supports FIFO popping, slicing, and pairing.
  • Pipeable: A function type func(IBioSequence) IBioSequence, enabling composable transformations.

Batch & Iterator Management

  • MakeIBioSequence(...): Constructs a new iterator (e.g., from files or slices).
  • Concat(...IBioSequence): Sequentially merges multiple iterators.
  • Pool(...): Interleaves batches from several sources, preserving global order via renumbering.
  • Rebatch(size) / RebatchBySize(maxBytes, maxCount): Dynamically regroups sequences into fixed or memory-bound batches.
  • SortBatches(): Ensures strict ordering by batch metadata (order field).
  • CompleteFileIterator(): Reads remaining file content as a single batch.

Functional Transformations

  • MakeIWorker(...), WorkerPipe(...): Applies per-sequence workers in parallel.
  • MakeISliceWorker(...), SliceWorkerPipe(...): Applies batch-level (SeqSliceWorker) transformations.
  • MakeIConditionalWorker(...): Conditional worker application based on a predicate.

Filtering & Splitting

  • FilterOn(pred, size): Parallel filtering with sequence recycling.
  • DivideOn(pred, size): Splits input into two independent iterators (true/false branches).
  • FilterAnd(pred, size): Same as above but enforces paired-end consistency.

Memory & Performance Control

  • LimitMemory(fraction): Enforces heap usage ≤ fraction × total RAM via backpressure (uses runtime.ReadMemStats()).
  • Parallel workers (nworkers) and batch sizes are configurable via defaults or variadic args.

Paired-End Data Handling

All operations preserve pairing semantics:

  • IsPaired(), MarkAsPaired() on iterators and batches.
  • PairTo(other): Synchronizes two batch/iterator pairs (same order required).
  • PairedWith(), UnPair() for mate extraction and unpairing.

Sequence Numbering & Annotation

  • NumberSequences(start, forceReordering): Assigns unique sequential IDs to sequences (same ID for mates in paired mode). Supports parallel or deterministic ordering.
  • MakeSetAttributeWorker(rank): Returns a worker that annotates each sequence with taxon at specified rank (e.g., "species").

Taxonomic Profiling

  • ExtractTaxonomy(iterator, seqAsTaxa): Aggregates taxonomy across all sequences via .Slice().ExtractTaxonomy() calls. Implements map-reduce semantics for scalable taxonomic summarization.

Fragmentation

  • IFragments(minsize, length, overlap): Splits long sequences into overlapping fragments (fusion mode for remainder), with parallel workers and memory-efficient recycling.

Utility & Analysis

  • Load(): Collects all sequences into a slice (for small data).
  • Count(recycle): Returns (variants, reads, nucleotides) counts.
  • Consume() / Recycle(): Drains iterator and optionally triggers sequence recycling.

Pipeline & Teeing

  • Pipeline(start, parts...): Composes a chain of Pipeable transformations.
  • (IBioSequence).Pipe(...): Fluent method chaining for pipelines.
  • Teeable / (IBioSequence).CopyTee(): Duplicates stream into two independent, concurrently readable iterators (preserves pairing).

Progress & Logging

  • Speed(), SpeedPipe(): Adds a non-intrusive progress bar (stderr only, terminal-aware). Updates per batch and respects --no-progressbar flag.

Distribution & Routing

  • IDistribute(classifier, batchSize): Routes sequences to classified outputs based on a classifier function. Batches per class key are flushed when size or memory thresholds are reached.
    • News() channel notifies on new output streams (i.e., newly seen class keys).
    • Thread-safe, async distribution via goroutines.

All public APIs assume interoperability with obiseq, obitax, and OBITools4s config modules (obidefault, obilog). Design emphasizes immutability-by-copy, safe concurrent access (via mutexes/atomics), and composability for reproducible bioinformatics pipelines.