⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+29
View File
@@ -0,0 +1,29 @@
# BioSequenceBatch: A Container for Ordered Biological Sequences
`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
## Core Properties
- **`source`**: String identifying the origin (e.g., file, pipeline stage).
- **`order`**: Integer defining processing sequence or priority.
- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
## Key Functionalities
- **Construction**:
`MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
- **Accessors**:
`Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
- **Mutation (via copy)**:
`Reorder(newOrder)` returns a new batch with updated order.
- **Size & emptiness**:
`Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
- **Consumption**:
`Pop0()` removes and returns the first sequence (FIFO behavior).
- **Safety**:
`IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
## Design Notes
- Instances are value types (struct), enabling safe copying.
- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
# `obiiter`: Stream-Based Biosequence Iterator Library
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
## Core Functionality
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
## Iterator Management
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
- **Lifecycle Control**:
- `Add(n)`, `Done()`: Track active workers (like goroutines).
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
## Batch Transformation & Reorganization
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
- **Concatenation & Pooling**:
- `Concat(...)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
## Filtering & Predicate-Based Processing
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
## Utility & Analysis
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
## Additional Features
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
- Batch ordering preserved for downstream reproducibility.
- Integrates with OBITools4s `obidefault`, `obiutils` for config and resource management.
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
+32
View File
@@ -0,0 +1,32 @@
# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
- **Key Fields**:
- `outputs`: A map from integer class codes to output streams (`IBioSequence`).
- `news`: An unbuffered channel emitting class codes when new output streams are created.
- `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
- **Batching Strategy**:
- Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
- Batches are flushed automatically and on finalization.
- **Asynchronous Processing**:
- The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
- Outputs are closed only after all sequences have been processed.
- **Notifications**:
- The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
- **Error Handling**:
- `Outputs(key)` returns an error if the requested key has no associated output.
- **Integration**:
- Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
- Uses `SortBatches()` on the input iterator to ensure ordered processing.
In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
# `ExtractTaxonomy` Function — Semantic Description
The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
- **Input**:
- A pointer to `IBioSequence`, representing a sequence iterator over biological data.
- A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
- **Process**:
- Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.
- For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.
- Stops and returns immediately upon encountering the first error during taxonomy extraction.
- **Output**:
- Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).
- Returns `nil` error on success; otherwise, returns the first encountered error.
- **Semantic Role**:
Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
- **Dependencies**:
Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
+28
View File
@@ -0,0 +1,28 @@
# `IFragments` Functionality Overview
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
## Core Parameters
- `minsize`: Minimum sequence length to skip fragmentation.
- `length`: Desired fragment size (in bases/amino acids).
- `overlap`: Number of overlapping residues between consecutive fragments.
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
## Workflow
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
2. **Parallel Fragmentation**:
- Each worker processes a subset of batches independently using goroutines.
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
3. **Resource Management**:
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
## Key Features
- **Overlap handling**: Ensures contiguous coverage without gaps.
- **Memory efficiency**: Uses recycling and batched output.
- **Scalability**: Leverages Go concurrency via `nworkers`.
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
## Use Case
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
+29
View File
@@ -0,0 +1,29 @@
# Memory-Limited Biosequence Iterator
This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
## Core Functionality
- **`LimitMemory(fraction float64)`**
Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
- **Memory Monitoring**
Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
- **Backpressure Mechanism**
While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
- **Logging**
Warns via `obilog.Warnf` when:
- Memory pressure persists (every ~1000 yields),
- Or wait duration becomes unusually long (>10,000 yielding cycles).
- **Concurrency Model**
- A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
- A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
## Semantic Behavior
- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
+19
View File
@@ -0,0 +1,19 @@
# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**
- Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
- Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
- For each batch:
- Collects up to `batchsize` sequences via the input iterator.
- Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
- Wraps merged results into a `BioSequenceBatch` with ordering metadata.
- Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**
- A *pipeline combinator* (higher-order function), enabling functional composition.
- Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
**Semantic Purpose**:
Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
+35
View File
@@ -0,0 +1,35 @@
# `NumberSequences` Function — Semantic Description
The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
## Core Functionality
- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
## Parallelization Strategy
- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
- **Reordering mode**: If `forceReordering` is true:
- Input iterator is batch-sorted (`SortBatches()`).
- Parallelism disabled (1 worker) to ensure deterministic numbering order.
## Implementation Details
- Each goroutine processes its own split of the input iterator.
- A shared `next_first` counter tracks the next available sequence number globally.
- Locking ensures atomic increment and assignment, preventing race conditions.
## Output
Returns a new `IBioSequence` iterator:
- Contains the same sequence batches (possibly reordered if sorted).
- Each `BioSequence` object now carries a `"seq_number"` attribute.
- Paired sequences are co-numbered and marked accordingly.
## Use Cases
- Preparing data for downstream tools requiring unique sequence IDs.
- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
- Reproducible numbering across pipeline stages or restarts.
+17
View File
@@ -0,0 +1,17 @@
# Paired-End Sequence Handling in `obiiter`
This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
- `BioSequenceBatch` methods:
- **`IsPaired()`**: Checks whether the batch contains paired reads.
- **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
- **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
- **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
- `IBioSequence` (iterator) methods:
- **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
- **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
- **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
- **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
+17
View File
@@ -0,0 +1,17 @@
# Semantic Description of `obiiter` Package Features
This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.
- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.
- Uses goroutines to ensure non-blocking parallel consumption.
- Ensures proper lifecycle management: closing the second stream when the first is closed.
- Preserves paired-end status (`MarkAsPaired`) if applicable.
Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
@@ -0,0 +1,28 @@
# `MakeSetAttributeWorker` Functionality Overview
The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequences metadata.
- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
- **Use case example**:
```go
worker := MakeSetAttributeWorker("species")
seq = worker(seq) // annotates `seq` with species-level taxon
```
- **Assumptions**:
- `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.
- Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).
- Sequences carry mutable taxonomic metadata.
- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
+31
View File
@@ -0,0 +1,31 @@
# `Speed` Functionality Description
The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
## Core Features
- **Non-intrusive progress bar**:
The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
- **Conditional rendering**:
The progress bar is only shown when:
- `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
- stderr is connected to a terminal (`os.ModeCharDevice`),
- stdout is *not* piped (to avoid interfering with file output).
- **Batch-aware counting**:
Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
- **Paired-end support**:
If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
- **Pipeable wrapper**:
`SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
## Implementation Highlights
- Uses goroutines to decouple iteration and progress updates.
- Automatically closes the output iterator when input ends (`WaitAndClose()`).
- Prints a final newline to stderr upon completion.
This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
+20
View File
@@ -0,0 +1,20 @@
# Semantic Description of `obiiter` Package Functionalities
This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:
Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:
Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:
Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.