⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+15
View File
@@ -0,0 +1,15 @@
# `ISequenceChunk` Function — Semantic Description
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
- Optional parameters allow fine-tuning:
- `dereplicate`: enables deduplication of identical sequences.
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
Key capabilities include:
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
# `ISequenceChunkOnMemory` Function — Semantic Description
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
Key features:
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
Input:
- An iterator (`obiiter.IBioSequence`) of raw sequences.
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
Output:
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obichunk` Package
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
## Core Concepts
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
## Supported Functionalities
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
- **Batch Processing Control**:
- `OptionBatchCount(number)` sets the number of batches.
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
## Design Highlights
- Functional options pattern for extensibility and readability.
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
+29
View File
@@ -0,0 +1,29 @@
# Semantic Description of `obichunk.ISequenceSubChunk`
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
## Core Functionality
- **Input**:
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
- **Processing**:
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
- For each incoming `BioSequenceBatch`:
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
- If the batch has ≤1 sequence, its passed through unchanged (but reordered with a new order ID).
- **Ordering Mechanism**:
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
- **Output**:
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
## Semantic Purpose
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
+45
View File
@@ -0,0 +1,45 @@
# Semantic Description of `IUniqueSequence` Functionality
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
## Core Workflow
1. **Input Processing**
Accepts an input sequence iterator and optional configuration via `WithOption`.
2. **Parallelization Strategy**
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
3. **Data Splitting Phase**
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
- Ensures deterministic chunking for reproducibility.
4. **Storage Choice**
- *In-memory*: via `ISequenceChunkOnMemory`.
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
5. **Uniqueness Classification**
- Builds a composite classifier combining:
- Sequence identity (`SequenceClassifier`)
- Optional annotation categories (e.g., sample, primer), with NA handling.
- If no annotations are specified, only raw sequence identity is used.
6. **Singleton Filtering**
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
7. **Parallel Dereplication**
- Spawns worker goroutines to process chunks.
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
8. **Output Merging**
- Aggregates results using `IMergeSequenceBatch`, preserving:
- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
## Key Features
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- **Configurable**: Via functional options (`Options`).
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).