Files
obitools4/autodoc/docmd/pkg/obichunk/unique.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

46 lines
2.0 KiB
Markdown

# Semantic Description of `IUniqueSequence` Functionality
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
## Core Workflow
1. **Input Processing**
Accepts an input sequence iterator and optional configuration via `WithOption`.
2. **Parallelization Strategy**
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
3. **Data Splitting Phase**
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
- Ensures deterministic chunking for reproducibility.
4. **Storage Choice**
- *In-memory*: via `ISequenceChunkOnMemory`.
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
5. **Uniqueness Classification**
- Builds a composite classifier combining:
- Sequence identity (`SequenceClassifier`)
- Optional annotation categories (e.g., sample, primer), with NA handling.
- If no annotations are specified, only raw sequence identity is used.
6. **Singleton Filtering**
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
7. **Parallel Dereplication**
- Spawns worker goroutines to process chunks.
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
8. **Output Merging**
- Aggregates results using `IMergeSequenceBatch`, preserving:
- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
## Key Features
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- **Configurable**: Via functional options (`Options`).
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).