- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
5.2 KiB
obichunk: High-Performance Chunking and Dereplication of Biological Sequences
The obichunk package provides scalable, configurable infrastructure for preprocessing large-scale biological sequence data (e.g., FASTA/FASTQ). It enables efficient grouping, sorting, deduplication, and batched streaming of sequences—critical for metabarcoding, metagenomics, or any high-throughput NGS workflow.
Core Functionalities
ISequenceChunk
Unified entry point for sequence chunking, supporting both in-memory and on-disk execution modes.
- Accepts an
obiiter.IBioSequenceiterator and a classifier (obiseq.BioSequenceClassifier). - Mode selection via
onMemoryflag: routes to eitherISequenceChunkOnMemoryorISequenceChunkOnDisk. - Optional parameters:
dereplicate: deduplicate identical sequences per batch.na: defines placeholder for missing/ambiguous characters (e.g.,"N","?").statsOn: enables metadata tracking (e.g., sample IDs, primer names) for statistics.uniqueClassifier: optional secondary classifier to assign unique labels.
Returns an iterator over processed sequences (obiiter.IBioSequence), supporting streaming pipelines and downstream integration.
ISequenceChunkOnDisk
Efficiently splits sequences into temporary on-disk batches (.fastx files), ideal for large datasets.
- Automatically manages a temp directory (
obiseq_chunks_*) and cleans up post-processing. - Uses
findto discover all generated chunk files recursively. - Asynchronous streaming: batches are yielded via an iterator as they’re written, decoupling production and consumption.
- Optional per-batch dereplication using composite keys (sequence + classification).
- Logs batch count and start events for monitoring.
Internally leverages:
obiiter.MakeIBioSequence()to build output iterator.obiformats.WriterDispatcherfor parallel file writing.- A dedicated goroutine to read, classify/dereplicate, and emit batches.
ISequenceChunkOnMemory
Performs in-memory parallel chunking of sequences into classification-based batches.
- Routes each sequence to a bucket (flux) using the classifier.
- Maintains one
BioSequenceSliceper classification group in memory (thread-safe via mutex). - Emits batches only after full input consumption, preserving deterministic batch order (0, 1, …).
- Parallel processing: each flux handled in its own goroutine.
- Fails fast on internal errors (e.g., channel issues) via
log.Fatalf.
Ideal for RAM-sufficient workloads requiring low-latency, ordered batch output.
Options System
Configurable pipeline behavior via functional options pattern.
- Immutable configuration builder:
MakeOptions([]WithOption)applies setters to internal struct. - Key options:
- Categorization:
OptionSubCategory(...)appends sample/marker labels;PopCategories()retrieves first. - Missing values:
OptionNAValue(na)customizes placeholder (default:"?"). - Statistics:
OptionStatOn(...)registers fields for metadata tracking. - Batching:
OptionBatchCount(n)sets number of batches (e.g., for hashing).OptionsBatchSize(size)defines items per batch.
- Concurrency:
OptionsParallelWorkers(n). - Sorting strategy:
OptionSortOnDisk()enables disk-backed sorting.OptionSortOnMemory()(default) uses RAM-based sort.
- Singleton filtering:
OptionsNoSingleton()excludes singleton reads (count = 1).OptionsWithSingleton()allows them.
- Categorization:
Defaults drawn from obidefault, ensuring reproducibility and ease of use.
ISequenceSubChunk
Parallel, class-based sorting and re-batching of sequence batches.
- Input: iterator over
BioSequenceBatch, classifier, and worker count. - For each batch:
- If size >1: sequences are sorted in-place by classification code (via custom
sort.Interface). - Consecutive sequences with same class are regrouped into new batches.
- If size >1: sequences are sorted in-place by classification code (via custom
- Uses atomic counters (
nextOrder) to assign globally increasing order IDs across workers—ensuring deterministic inter-batch ordering. - Preserves input-order within each new batch.
Use case: preparing sorted, class-homogeneous batches for downstream tasks (e.g., consensus calling or alignment).
IUniqueSequence
End-to-end dereplication pipeline: groups identical sequences, aggregates counts and metadata.
- Input iterator + optional
Options. - Parallelization via configurable workers (falls back to single-threaded if disk sorting enabled).
- Splitting phase:
- Uses
HashClassifierto partition input deterministically (controlled byBatchCount).
- Uses
- Storage selection:
- In-memory: via
ISequenceChunkOnMemory. - On-disk: uses
ISequenceSubChunk+ external sort (single worker required).
- In-memory: via
- Uniqueness logic:
- Composite classifier: sequence identity + optional annotations (sample, primer).
- NA handling for missing annotation fields.
- Singleton filtering: optionally excludes reads with count =1 (
NoSingleton()). - Parallel deduplication:
- Workers process chunks via
ISequenceSubChunk+ per-group aggregation.
- Workers process chunks via
- Merging:
- Aggregates results via
IMergeSequenceBatch, preserving counts, stats, and ordering.
- Aggregates results via
Scalable from small datasets to terabyte-scale NGS runs.