⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,29 @@
+# BioSequenceBatch: A Container for Ordered Biological Sequences
+
+`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
+
+## Core Properties
+- **`source`**: String identifying the origin (e.g., file, pipeline stage).
+- **`order`**: Integer defining processing sequence or priority.
+- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
+
+## Key Functionalities
+- **Construction**:  
+  `MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
+- **Accessors**:  
+  `Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
+- **Mutation (via copy)**:  
+  `Reorder(newOrder)` returns a new batch with updated order.
+- **Size & emptiness**:  
+  `Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
+- **Consumption**:  
+  `Pop0()` removes and returns the first sequence (FIFO behavior).
+- **Safety**:  
+  `IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
+
+## Design Notes
+- Instances are value types (struct), enabling safe copying.
+- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
+- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
+
+This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
+# `obiiter`: Stream-Based Biosequence Iterator Library
+
+This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
+
+## Core Functionality
+
+- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
+- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
+- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
+
+## Iterator Management
+
+- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
+- **Lifecycle Control**:
+  - `Add(n)`, `Done()`: Track active workers (like goroutines).
+  - `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
+  - `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
+
+## Batch Transformation & Reorganization
+
+- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
+- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
+- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
+- **Concatenation & Pooling**:
+  - `Concat(...)`: Sequentially merges multiple iterators.
+  - `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
+
+## Filtering & Predicate-Based Processing
+
+- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
+- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
+- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
+
+## Utility & Analysis
+
+- **`Load()`**: Collects all sequences into a single slice (for small datasets).
+- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
+- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
+- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
+
+## Additional Features
+
+- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
+- Batch ordering preserved for downstream reproducibility.
+- Integrates with OBITools4’s `obidefault`, `obiutils` for config and resource management.
+
+> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
@@ -0,0 +1,32 @@
+# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
+
+The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
+
+- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
+
+- **Key Fields**:
+  - `outputs`: A map from integer class codes to output streams (`IBioSequence`).
+  - `news`: An unbuffered channel emitting class codes when new output streams are created.
+  - `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
+
+- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
+
+- **Batching Strategy**:
+  - Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
+  - Batches are flushed automatically and on finalization.
+
+- **Asynchronous Processing**:
+  - The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
+  - Outputs are closed only after all sequences have been processed.
+
+- **Notifications**:
+  - The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
+
+- **Error Handling**:
+  - `Outputs(key)` returns an error if the requested key has no associated output.
+
+- **Integration**:
+  - Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
+  - Uses `SortBatches()` on the input iterator to ensure ordered processing.
+
+In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
+# `ExtractTaxonomy` Function — Semantic Description
+
+The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
+
+- **Input**:  
+  - A pointer to `IBioSequence`, representing a sequence iterator over biological data.  
+  - A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
+
+- **Process**:  
+  - Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.  
+  - For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.  
+  - Stops and returns immediately upon encountering the first error during taxonomy extraction.
+
+- **Output**:  
+  - Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).  
+  - Returns `nil` error on success; otherwise, returns the first encountered error.
+
+- **Semantic Role**:  
+  Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
+
+- **Dependencies**:  
+  Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
+
+This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
@@ -0,0 +1,28 @@
+# `IFragments` Functionality Overview
+
+The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
+
+## Core Parameters
+- `minsize`: Minimum sequence length to skip fragmentation.
+- `length`: Desired fragment size (in bases/amino acids).
+- `overlap`: Number of overlapping residues between consecutive fragments.
+- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
+
+## Workflow
+1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
+2. **Parallel Fragmentation**:
+   - Each worker processes a subset of batches independently using goroutines.
+   - For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
+   - The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
+3. **Resource Management**:
+   - Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
+   - Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
+
+## Key Features
+- **Overlap handling**: Ensures contiguous coverage without gaps.
+- **Memory efficiency**: Uses recycling and batched output.
+- **Scalability**: Leverages Go concurrency via `nworkers`.
+- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
+
+## Use Case
+Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
@@ -0,0 +1,29 @@
+# Memory-Limited Biosequence Iterator
+
+This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
+
+## Core Functionality
+
+- **`LimitMemory(fraction float64)`**  
+  Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
+
+- **Memory Monitoring**  
+  Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
+
+- **Backpressure Mechanism**  
+  While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
+
+- **Logging**  
+  Warns via `obilog.Warnf` when:
+  - Memory pressure persists (every ~1000 yields),
+  - Or wait duration becomes unusually long (>10,000 yielding cycles).
+
+- **Concurrency Model**  
+  - A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
+  - A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
+
+## Semantic Behavior
+
+- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
+- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
+- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
@@ -0,0 +1,19 @@
+# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
+
+This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
+
+- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**  
+  - Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
+  - Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
+  - For each batch:
+    - Collects up to `batchsize` sequences via the input iterator.
+    - Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
+    - Wraps merged results into a `BioSequenceBatch` with ordering metadata.
+  - Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
+
+- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**  
+  - A *pipeline combinator* (higher-order function), enabling functional composition.
+  - Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
+
+**Semantic Purpose**:  
+Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
@@ -0,0 +1,35 @@
+# `NumberSequences` Function — Semantic Description
+
+The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
+
+## Core Functionality
+
+- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
+- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
+- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
+
+## Parallelization Strategy
+
+- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
+- **Reordering mode**: If `forceReordering` is true:
+  - Input iterator is batch-sorted (`SortBatches()`).
+  - Parallelism disabled (1 worker) to ensure deterministic numbering order.
+
+## Implementation Details
+
+- Each goroutine processes its own split of the input iterator.
+- A shared `next_first` counter tracks the next available sequence number globally.
+- Locking ensures atomic increment and assignment, preventing race conditions.
+
+## Output
+
+Returns a new `IBioSequence` iterator:
+- Contains the same sequence batches (possibly reordered if sorted).
+- Each `BioSequence` object now carries a `"seq_number"` attribute.
+- Paired sequences are co-numbered and marked accordingly.
+
+## Use Cases
+
+- Preparing data for downstream tools requiring unique sequence IDs.
+- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
+- Reproducible numbering across pipeline stages or restarts.
@@ -0,0 +1,17 @@
+# Paired-End Sequence Handling in `obiiter`
+
+This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
+
+- `BioSequenceBatch` methods:
+  - **`IsPaired()`**: Checks whether the batch contains paired reads.
+  - **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
+  - **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
+  - **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
+
+- `IBioSequence` (iterator) methods:
+  - **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
+  - **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
+  - **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
+  - **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
+
+All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.
@@ -0,0 +1,17 @@
+# Semantic Description of `obiiter` Package Features
+
+This Go package provides functional-style utilities for processing biological sequence data (e.g., FASTQ/FASTA), modeled via the `IBioSequence` interface.
+
+- **`Pipeable`**: A function type representing a unary transformation on an `IBioSequence`.  
+- **`Pipeline(start, parts...)`**: Composes a sequence of `Pipeable` operations into a single executable pipeline. It applies transformations sequentially: input → start → part₁ → … → output.
+
+- **`(IBioSequence).Pipe(start, parts...)`**: A convenience method enabling fluent chaining of transformations directly on a sequence object.
+
+- **`Teeable`**: A function type for operations that split input into two independent output streams (e.g., filtering + logging).
+
+- **`(IBioSequence).CopyTee()`**: A high-level tee operation that duplicates the input stream into two identical, concurrently readable `IBioSequence` instances.  
+  - Uses goroutines to ensure non-blocking parallel consumption.
+  - Ensures proper lifecycle management: closing the second stream when the first is closed.  
+  - Preserves paired-end status (`MarkAsPaired`) if applicable.
+
+Together, these features support modular, composable, and concurrent biosequence processing pipelines—ideal for scalable NGS data workflows.
@@ -0,0 +1,28 @@
+# `MakeSetAttributeWorker` Functionality Overview
+
+The function `MakeSetAttributeWorker(rank string) obiiter.SeqWorker` constructs a reusable sequence-processing worker for taxonomic annotation.
+
+- **Input validation**: It first verifies that the provided `rank` is part of a predefined taxonomic hierarchy (`taxonomy.RankList()`). If invalid, it terminates execution with an informative error.
+
+- **Worker construction**: It returns a closure (`obiiter.SeqWorker`) — essentially a function that transforms biological sequences.
+
+- **Core behavior**: For each input `*obiseq.BioSequence`, it calls `taxonomy.SetTaxonAtRank(sequence, rank)`. This likely assigns or updates the taxonomic label (e.g., species, genus) at the specified rank in the sequence’s metadata.
+
+- **Purpose**: Enables modular, pipeline-friendly taxonomic annotation — e.g., in bioinformatics workflows where sequences must be annotated hierarchically (e.g., from phylum down to species).
+
+- **Design pattern**: Follows the *functional factory* and *worker interface* patterns, promoting composability in sequence processing pipelines.
+
+- **Side effects**: Modifies the input `BioSequence` *in-place* (via mutation of its taxonomic metadata), then returns it.
+
+- **Use case example**:  
+  ```go
+  worker := MakeSetAttributeWorker("species")
+  seq = worker(seq) // annotates `seq` with species-level taxon
+  ```
+
+- **Assumptions**:  
+   - `taxonomy.SetTaxonAtRank` exists and handles rank-specific taxon assignment.  
+   - Taxonomic ranks are ordered, finite, and validated (e.g., `["domain", "phylum", ..., "species"]`).  
+   - Sequences carry mutable taxonomic metadata.
+
+- **Error handling**: Fails fast on invalid rank input, preventing silent misannotation.
@@ -0,0 +1,31 @@
+# `Speed` Functionality Description
+
+The provided Go code defines a method and helper function to add **real-time progress tracking** to biosequence iterators in the OBITools4 framework.
+
+## Core Features
+
+- **Non-intrusive progress bar**:  
+  The `Speed()` method wraps an existing iterator and displays a visual progress indicator on stderr, using the [`progressbar`](https://github.com/schollz/progressbar) library.
+
+- **Conditional rendering**:  
+  The progress bar is only shown when:
+    - `--no-progressbar` flag is *not* set (via `obidefault.ProgressBar()`),
+    - stderr is connected to a terminal (`os.ModeCharDevice`),
+    - stdout is *not* piped (to avoid interfering with file output).
+
+- **Batch-aware counting**:  
+  Progress is updated per batch (`batch.Len()`), not item-by-item, for efficiency and smoother UI updates (throttled to ≥100ms).
+
+- **Paired-end support**:  
+  If the input iterator is paired (`IsPaired()`), this property is preserved in the returned iterator.
+
+- **Pipeable wrapper**:  
+  `SpeedPipe()` enables integration into functional pipelines (e.g., `.Map(...).Filter(...)`) by returning a `Pipeable` function.
+
+## Implementation Highlights
+
+- Uses goroutines to decouple iteration and progress updates.
+- Automatically closes the output iterator when input ends (`WaitAndClose()`).
+- Prints a final newline to stderr upon completion.
+
+This utility enhances user experience during long-running sequence processing (e.g., FASTQ parsing, alignment), without affecting correctness or performance in non-interactive contexts.
@@ -0,0 +1,20 @@
+# Semantic Description of `obiiter` Package Functionalities
+
+This Go package (`obiiter`) provides utilities for applying functional transformations to biological sequence iterators, supporting parallel execution and modular piping.
+
+- **`MakeIWorker(worker, breakOnError bool, sizes ...int)`**:  
+  Applies a `SeqWorker` (sequence-to-sequence transformation) to each sequence in the iterator. Supports configurable parallelism (`nworkers`) and optional channel buffering via `sizes`. Uses internal conversion to slice-based workers.
+
+- **`MakeIConditionalWorker(predicate, worker, breakOnError bool, sizes ...int)`**:  
+  Applies a `SeqWorker` only to sequences satisfying a given boolean `predicate`. Enables conditional, parallelized processing while preserving iterator semantics.
+
+- **`MakeISliceWorker(worker, breakOnError bool, sizes ...int)`**:  
+  Core method applying a `SeqSliceWorker` (batch-level transformation) across slices of sequences. Implements multi-goroutine parallelism using `nworkers`. Handles errors optionally via fatal logging (`breakOnError`). Preserves paired-end metadata.
+
+- **`WorkerPipe(worker, breakOnError bool, sizes ...int)`**:  
+  Returns a `Pipeable` closure wrapping `MakeIWorker`, enabling composition in pipeline chains (e.g., for CLI or DSL-style workflows).
+
+- **`SliceWorkerPipe(worker, breakOnError bool, sizes ...int)`**:  
+  Similar to `WorkerPipe`, but for slice-level workers (`SeqSliceWorker`). Facilitates modular, reusable pipeline stages.
+
+All methods support optional size arguments to override default parallelism (from `obidefault`). Internally, they rely on Go concurrency primitives (`go`, channels) and structured batch processing via `IBioSequence` interface.