Files
obitools4/autodoc/docmd/pkg_obichunk.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

104 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obichunk`: High-Performance Chunking and Dereplication of Biological Sequences
The `obichunk` package provides scalable, configurable infrastructure for preprocessing large-scale biological sequence data (e.g., FASTA/FASTQ). It enables efficient grouping, sorting, deduplication, and batched streaming of sequences—critical for metabarcoding, metagenomics, or any high-throughput NGS workflow.
---
## Core Functionalities
### `ISequenceChunk`
Unified entry point for sequence chunking, supporting both **in-memory** and **on-disk** execution modes.
- Accepts an `obiiter.IBioSequence` iterator and a classifier (`obiseq.BioSequenceClassifier`).
- Mode selection via `onMemory` flag: routes to either `ISequenceChunkOnMemory` or `ISequenceChunkOnDisk`.
- Optional parameters:
- `dereplicate`: deduplicate identical sequences per batch.
- `na`: defines placeholder for missing/ambiguous characters (e.g., `"N"`, `"?"`).
- `statsOn`: enables metadata tracking (e.g., sample IDs, primer names) for statistics.
- `uniqueClassifier`: optional secondary classifier to assign unique labels.
Returns an iterator over processed sequences (`obiiter.IBioSequence`), supporting streaming pipelines and downstream integration.
---
### `ISequenceChunkOnDisk`
Efficiently splits sequences into **temporary on-disk batches** (`.fastx` files), ideal for large datasets.
- Automatically manages a temp directory (`obiseq_chunks_*`) and cleans up post-processing.
- Uses `find` to discover all generated chunk files recursively.
- Asynchronous streaming: batches are yielded via an iterator as theyre written, decoupling production and consumption.
- Optional per-batch dereplication using composite keys (sequence + classification).
- Logs batch count and start events for monitoring.
Internally leverages:
- `obiiter.MakeIBioSequence()` to build output iterator.
- `obiformats.WriterDispatcher` for parallel file writing.
- A dedicated goroutine to read, classify/dereplicate, and emit batches.
---
### `ISequenceChunkOnMemory`
Performs **in-memory parallel chunking** of sequences into classification-based batches.
- Routes each sequence to a bucket (flux) using the classifier.
- Maintains one `BioSequenceSlice` per classification group in memory (thread-safe via mutex).
- Emits batches **only after full input consumption**, preserving deterministic batch order (0, 1, …).
- Parallel processing: each flux handled in its own goroutine.
- Fails fast on internal errors (e.g., channel issues) via `log.Fatalf`.
Ideal for RAM-sufficient workloads requiring low-latency, ordered batch output.
---
### `Options` System
Configurable pipeline behavior via functional options pattern.
- Immutable configuration builder: `MakeOptions([]WithOption)` applies setters to internal struct.
- Key options:
- **Categorization**: `OptionSubCategory(...)` appends sample/marker labels; `PopCategories()` retrieves first.
- **Missing values**: `OptionNAValue(na)` customizes placeholder (default: `"?"`).
- **Statistics**: `OptionStatOn(...)` registers fields for metadata tracking.
- **Batching**:
- `OptionBatchCount(n)` sets number of batches (e.g., for hashing).
- `OptionsBatchSize(size)` defines items per batch.
- **Concurrency**: `OptionsParallelWorkers(n)`.
- **Sorting strategy**:
- `OptionSortOnDisk()` enables disk-backed sorting.
- `OptionSortOnMemory()` (default) uses RAM-based sort.
- **Singleton filtering**:
- `OptionsNoSingleton()` excludes singleton reads (count = 1).
- `OptionsWithSingleton()` allows them.
Defaults drawn from `obidefault`, ensuring reproducibility and ease of use.
---
### `ISequenceSubChunk`
Parallel, class-based sorting and re-batching of sequence batches.
- Input: iterator over `BioSequenceBatch`, classifier, and worker count.
- For each batch:
- If size >1: sequences are sorted *in-place* by classification code (via custom `sort.Interface`).
- Consecutive sequences with same class are regrouped into new batches.
- Uses atomic counters (`nextOrder`) to assign globally increasing order IDs across workers—ensuring deterministic inter-batch ordering.
- Preserves input-order *within* each new batch.
Use case: preparing sorted, class-homogeneous batches for downstream tasks (e.g., consensus calling or alignment).
---
### `IUniqueSequence`
End-to-end **dereplication** pipeline: groups identical sequences, aggregates counts and metadata.
- Input iterator + optional `Options`.
- Parallelization via configurable workers (falls back to single-threaded if disk sorting enabled).
- **Splitting phase**:
- Uses `HashClassifier` to partition input deterministically (controlled by `BatchCount`).
- **Storage selection**:
- In-memory: via `ISequenceChunkOnMemory`.
- On-disk: uses `ISequenceSubChunk` + external sort (single worker required).
- **Uniqueness logic**:
- Composite classifier: sequence identity + optional annotations (sample, primer).
- NA handling for missing annotation fields.
- **Singleton filtering**: optionally excludes reads with count =1 (`NoSingleton()`).
- **Parallel deduplication**:
- Workers process chunks via `ISequenceSubChunk` + per-group aggregation.
- **Merging**:
- Aggregates results via `IMergeSequenceBatch`, preserving counts, stats, and ordering.
Scalable from small datasets to terabyte-scale NGS runs.