mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.0 KiB
2.0 KiB
Semantic Description of IUniqueSequence Functionality
The IUniqueSequence function performs dereplication of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an obiiter.IBioSequenceBatch iterator.
Core Workflow
-
Input Processing
Accepts an input sequence iterator and optional configuration viaWithOption. -
Parallelization Strategy
Supports configurable parallel workers (nworkers). WhenSortOnDisk()is enabled, it falls back to single-threaded processing for disk-based sorting. -
Data Splitting Phase
- Uses
HashClassifierto partition input into buckets (controlled byBatchCount). - Ensures deterministic chunking for reproducibility.
- Uses
-
Storage Choice
- In-memory: via
ISequenceChunkOnMemory. - Disk-based: via
ISequenceSubChunk+ external sorting (requires single worker).
- In-memory: via
-
Uniqueness Classification
- Builds a composite classifier combining:
- Sequence identity (
SequenceClassifier) - Optional annotation categories (e.g., sample, primer), with NA handling.
- Sequence identity (
- If no annotations are specified, only raw sequence identity is used.
- Builds a composite classifier combining:
-
Singleton Filtering
Optionally excludes singleton reads (count = 1) viaNoSingleton()option. -
Parallel Dereplication
- Spawns worker goroutines to process chunks.
- Each worker applies
ISequenceSubChunk+ deduplication logic per classifier group.
-
Output Merging
- Aggregates results using
IMergeSequenceBatch, preserving:- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
- Aggregates results using
Key Features
- Scalable: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- Configurable: Via functional options (
Options). - Thread-safe: Uses
sync.Mutexfor deterministic ordering. - Metadata-aware: Incorporates annotation-based grouping (e.g., sample, primer).