Files
obitools4/autodoc/docmd/pkg_obichunk.md
T

104 lines
5.2 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# `obichunk`: High-Performance Chunking and Dereplication of Biological Sequences
The `obichunk` package provides scalable, configurable infrastructure for preprocessing large-scale biological sequence data (e.g., FASTA/FASTQ). It enables efficient grouping, sorting, deduplication, and batched streaming of sequences—critical for metabarcoding, metagenomics, or any high-throughput NGS workflow.
---
## Core Functionalities
### `ISequenceChunk`
Unified entry point for sequence chunking, supporting both **in-memory** and **on-disk** execution modes.
- Accepts an `obiiter.IBioSequence` iterator and a classifier (`obiseq.BioSequenceClassifier`).
- Mode selection via `onMemory` flag: routes to either `ISequenceChunkOnMemory` or `ISequenceChunkOnDisk`.
- Optional parameters:
- `dereplicate`: deduplicate identical sequences per batch.
- `na`: defines placeholder for missing/ambiguous characters (e.g., `"N"`, `"?"`).
- `statsOn`: enables metadata tracking (e.g., sample IDs, primer names) for statistics.
- `uniqueClassifier`: optional secondary classifier to assign unique labels.
Returns an iterator over processed sequences (`obiiter.IBioSequence`), supporting streaming pipelines and downstream integration.
---
### `ISequenceChunkOnDisk`
Efficiently splits sequences into **temporary on-disk batches** (`.fastx` files), ideal for large datasets.
- Automatically manages a temp directory (`obiseq_chunks_*`) and cleans up post-processing.
- Uses `find` to discover all generated chunk files recursively.
- Asynchronous streaming: batches are yielded via an iterator as theyre written, decoupling production and consumption.
- Optional per-batch dereplication using composite keys (sequence + classification).
- Logs batch count and start events for monitoring.
Internally leverages:
- `obiiter.MakeIBioSequence()` to build output iterator.
- `obiformats.WriterDispatcher` for parallel file writing.
- A dedicated goroutine to read, classify/dereplicate, and emit batches.
---
### `ISequenceChunkOnMemory`
Performs **in-memory parallel chunking** of sequences into classification-based batches.
- Routes each sequence to a bucket (flux) using the classifier.
- Maintains one `BioSequenceSlice` per classification group in memory (thread-safe via mutex).
- Emits batches **only after full input consumption**, preserving deterministic batch order (0, 1, …).
- Parallel processing: each flux handled in its own goroutine.
- Fails fast on internal errors (e.g., channel issues) via `log.Fatalf`.
Ideal for RAM-sufficient workloads requiring low-latency, ordered batch output.
---
### `Options` System
Configurable pipeline behavior via functional options pattern.
- Immutable configuration builder: `MakeOptions([]WithOption)` applies setters to internal struct.
- Key options:
- **Categorization**: `OptionSubCategory(...)` appends sample/marker labels; `PopCategories()` retrieves first.
- **Missing values**: `OptionNAValue(na)` customizes placeholder (default: `"?"`).
- **Statistics**: `OptionStatOn(...)` registers fields for metadata tracking.
- **Batching**:
- `OptionBatchCount(n)` sets number of batches (e.g., for hashing).
- `OptionsBatchSize(size)` defines items per batch.
- **Concurrency**: `OptionsParallelWorkers(n)`.
- **Sorting strategy**:
- `OptionSortOnDisk()` enables disk-backed sorting.
- `OptionSortOnMemory()` (default) uses RAM-based sort.
- **Singleton filtering**:
- `OptionsNoSingleton()` excludes singleton reads (count = 1).
- `OptionsWithSingleton()` allows them.
Defaults drawn from `obidefault`, ensuring reproducibility and ease of use.
---
### `ISequenceSubChunk`
Parallel, class-based sorting and re-batching of sequence batches.
- Input: iterator over `BioSequenceBatch`, classifier, and worker count.
- For each batch:
- If size >1: sequences are sorted *in-place* by classification code (via custom `sort.Interface`).
- Consecutive sequences with same class are regrouped into new batches.
- Uses atomic counters (`nextOrder`) to assign globally increasing order IDs across workers—ensuring deterministic inter-batch ordering.
- Preserves input-order *within* each new batch.
Use case: preparing sorted, class-homogeneous batches for downstream tasks (e.g., consensus calling or alignment).
---
### `IUniqueSequence`
End-to-end **dereplication** pipeline: groups identical sequences, aggregates counts and metadata.
- Input iterator + optional `Options`.
- Parallelization via configurable workers (falls back to single-threaded if disk sorting enabled).
- **Splitting phase**:
- Uses `HashClassifier` to partition input deterministically (controlled by `BatchCount`).
- **Storage selection**:
- In-memory: via `ISequenceChunkOnMemory`.
- On-disk: uses `ISequenceSubChunk` + external sort (single worker required).
- **Uniqueness logic**:
- Composite classifier: sequence identity + optional annotations (sample, primer).
- NA handling for missing annotation fields.
- **Singleton filtering**: optionally excludes reads with count =1 (`NoSingleton()`).
- **Parallel deduplication**:
- Workers process chunks via `ISequenceSubChunk` + per-group aggregation.
- **Merging**:
- Aggregates results via `IMergeSequenceBatch`, preserving counts, stats, and ordering.
Scalable from small datasets to terabyte-scale NGS runs.