Files
obitools4/autodoc/docmd/pkg/obichunk/unique.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.0 KiB

Semantic Description of IUniqueSequence Functionality

The IUniqueSequence function performs dereplication of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an obiiter.IBioSequenceBatch iterator.

Core Workflow

  1. Input Processing
    Accepts an input sequence iterator and optional configuration via WithOption.

  2. Parallelization Strategy
    Supports configurable parallel workers (nworkers). When SortOnDisk() is enabled, it falls back to single-threaded processing for disk-based sorting.

  3. Data Splitting Phase

    • Uses HashClassifier to partition input into buckets (controlled by BatchCount).
    • Ensures deterministic chunking for reproducibility.
  4. Storage Choice

    • In-memory: via ISequenceChunkOnMemory.
    • Disk-based: via ISequenceSubChunk + external sorting (requires single worker).
  5. Uniqueness Classification

    • Builds a composite classifier combining:
      • Sequence identity (SequenceClassifier)
      • Optional annotation categories (e.g., sample, primer), with NA handling.
    • If no annotations are specified, only raw sequence identity is used.
  6. Singleton Filtering
    Optionally excludes singleton reads (count = 1) via NoSingleton() option.

  7. Parallel Dereplication

    • Spawns worker goroutines to process chunks.
    • Each worker applies ISequenceSubChunk + deduplication logic per classifier group.
  8. Output Merging

    • Aggregates results using IMergeSequenceBatch, preserving:
      • Sequence counts
      • Statistics (if enabled)
      • NA handling and ordering

Key Features

  • Scalable: Supports both memory-efficient (disk) and high-speed (RAM) modes.
  • Configurable: Via functional options (Options).
  • Thread-safe: Uses sync.Mutex for deterministic ordering.
  • Metadata-aware: Incorporates annotation-based grouping (e.g., sample, primer).