mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,15 @@
|
||||
# `ISequenceChunk` Function — Semantic Description
|
||||
|
||||
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
|
||||
|
||||
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
|
||||
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
|
||||
- Optional parameters allow fine-tuning:
|
||||
- `dereplicate`: enables deduplication of identical sequences.
|
||||
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
|
||||
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
|
||||
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
|
||||
|
||||
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
|
||||
|
||||
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
|
||||
@@ -0,0 +1,18 @@
|
||||
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
|
||||
|
||||
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
|
||||
|
||||
Key capabilities include:
|
||||
|
||||
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
|
||||
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
|
||||
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
|
||||
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
|
||||
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
|
||||
|
||||
Internally, `ISequenceChunkOnDisk` uses:
|
||||
- `obiiter.MakeIBioSequence()` to build the output iterator,
|
||||
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
|
||||
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
|
||||
|
||||
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
|
||||
@@ -0,0 +1,21 @@
|
||||
# `ISequenceChunkOnMemory` Function — Semantic Description
|
||||
|
||||
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
|
||||
|
||||
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
|
||||
|
||||
Key features:
|
||||
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
|
||||
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
|
||||
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
|
||||
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
|
||||
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
|
||||
|
||||
Input:
|
||||
- An iterator (`obiiter.IBioSequence`) of raw sequences.
|
||||
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
|
||||
|
||||
Output:
|
||||
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
|
||||
|
||||
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
|
||||
@@ -0,0 +1,26 @@
|
||||
# Semantic Description of `obichunk` Package
|
||||
|
||||
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
|
||||
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
|
||||
|
||||
## Supported Functionalities
|
||||
|
||||
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
|
||||
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
|
||||
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
|
||||
- **Batch Processing Control**:
|
||||
- `OptionBatchCount(number)` sets the number of batches.
|
||||
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
|
||||
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
|
||||
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
|
||||
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Functional options pattern for extensibility and readability.
|
||||
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
|
||||
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
|
||||
@@ -0,0 +1,29 @@
|
||||
# Semantic Description of `obichunk.ISequenceSubChunk`
|
||||
|
||||
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**:
|
||||
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
|
||||
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
|
||||
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
|
||||
|
||||
- **Processing**:
|
||||
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
|
||||
- For each incoming `BioSequenceBatch`:
|
||||
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
|
||||
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
|
||||
- If the batch has ≤1 sequence, it’s passed through unchanged (but reordered with a new order ID).
|
||||
|
||||
- **Ordering Mechanism**:
|
||||
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
|
||||
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
|
||||
|
||||
- **Output**:
|
||||
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
|
||||
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
|
||||
|
||||
## Semantic Purpose
|
||||
|
||||
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
|
||||
@@ -0,0 +1,45 @@
|
||||
# Semantic Description of `IUniqueSequence` Functionality
|
||||
|
||||
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
|
||||
|
||||
## Core Workflow
|
||||
|
||||
1. **Input Processing**
|
||||
Accepts an input sequence iterator and optional configuration via `WithOption`.
|
||||
|
||||
2. **Parallelization Strategy**
|
||||
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
|
||||
|
||||
3. **Data Splitting Phase**
|
||||
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
|
||||
- Ensures deterministic chunking for reproducibility.
|
||||
|
||||
4. **Storage Choice**
|
||||
- *In-memory*: via `ISequenceChunkOnMemory`.
|
||||
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
|
||||
|
||||
5. **Uniqueness Classification**
|
||||
- Builds a composite classifier combining:
|
||||
- Sequence identity (`SequenceClassifier`)
|
||||
- Optional annotation categories (e.g., sample, primer), with NA handling.
|
||||
- If no annotations are specified, only raw sequence identity is used.
|
||||
|
||||
6. **Singleton Filtering**
|
||||
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
|
||||
|
||||
7. **Parallel Dereplication**
|
||||
- Spawns worker goroutines to process chunks.
|
||||
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
|
||||
|
||||
8. **Output Merging**
|
||||
- Aggregates results using `IMergeSequenceBatch`, preserving:
|
||||
- Sequence counts
|
||||
- Statistics (if enabled)
|
||||
- NA handling and ordering
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
|
||||
- **Configurable**: Via functional options (`Options`).
|
||||
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
|
||||
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
|
||||
Reference in New Issue
Block a user