⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,15 @@
+# `ISequenceChunk` Function — Semantic Description
+
+The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
+
+- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
+- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
+- Optional parameters allow fine-tuning:
+  - `dereplicate`: enables deduplication of identical sequences.
+  - `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
+  - `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
+  - `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
+
+The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
+
+This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
+# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
+
+The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
+
+Key capabilities include:
+
+- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
+- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
+- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
+- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
+- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
+
+Internally, `ISequenceChunkOnDisk` uses:
+- `obiiter.MakeIBioSequence()` to build the output iterator,
+- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
+- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
+
+Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
+# `ISequenceChunkOnMemory` Function — Semantic Description
+
+The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
+
+It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
+
+Key features:
+- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
+- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
+- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
+- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
+- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
+
+Input:
+- An iterator (`obiiter.IBioSequence`) of raw sequences.
+- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
+
+Output:
+- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
+
+Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
@@ -0,0 +1,26 @@
+# Semantic Description of `obichunk` Package
+
+The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
+
+## Core Concepts
+
+- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
+- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
+
+## Supported Functionalities
+
+- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
+- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
+- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
+- **Batch Processing Control**:
+  - `OptionBatchCount(number)` sets the number of batches.
+  - `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
+- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
+- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
+- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
+
+## Design Highlights
+
+- Functional options pattern for extensibility and readability.
+- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
+- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
@@ -0,0 +1,29 @@
+# Semantic Description of `obichunk.ISequenceSubChunk`
+
+The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
+
+## Core Functionality
+
+- **Input**:  
+  - An iterator over `BioSequence` batches (`obiiter.IBioSequence`)  
+  - A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code  
+  - A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism  
+
+- **Processing**:  
+  - Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.  
+  - For each incoming `BioSequenceBatch`:  
+    - If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.  
+    - Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.  
+    - If the batch has ≤1 sequence, it’s passed through unchanged (but reordered with a new order ID).  
+
+- **Ordering Mechanism**:  
+  - Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.  
+  - Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).  
+
+- **Output**:  
+  - Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.  
+  - Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
+
+## Semantic Purpose
+
+Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
@@ -0,0 +1,45 @@
+# Semantic Description of `IUniqueSequence` Functionality
+
+The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
+
+## Core Workflow
+
+1. **Input Processing**  
+   Accepts an input sequence iterator and optional configuration via `WithOption`.
+
+2. **Parallelization Strategy**  
+   Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
+
+3. **Data Splitting Phase**  
+   - Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).  
+   - Ensures deterministic chunking for reproducibility.
+
+4. **Storage Choice**  
+   - *In-memory*: via `ISequenceChunkOnMemory`.  
+   - *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
+
+5. **Uniqueness Classification**  
+   - Builds a composite classifier combining:
+     - Sequence identity (`SequenceClassifier`)
+     - Optional annotation categories (e.g., sample, primer), with NA handling.
+   - If no annotations are specified, only raw sequence identity is used.
+
+6. **Singleton Filtering**  
+   Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
+
+7. **Parallel Dereplication**  
+   - Spawns worker goroutines to process chunks.
+   - Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
+
+8. **Output Merging**  
+   - Aggregates results using `IMergeSequenceBatch`, preserving:
+     - Sequence counts
+     - Statistics (if enabled)
+     - NA handling and ordering
+
+## Key Features
+
+- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
+- **Configurable**: Via functional options (`Options`).
+- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
+- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).