# `obichunk`: High-Performance Chunking and Dereplication of Biological Sequences The `obichunk` package provides scalable, configurable infrastructure for preprocessing large-scale biological sequence data (e.g., FASTA/FASTQ). It enables efficient grouping, sorting, deduplication, and batched streaming of sequences—critical for metabarcoding, metagenomics, or any high-throughput NGS workflow. --- ## Core Functionalities ### `ISequenceChunk` Unified entry point for sequence chunking, supporting both **in-memory** and **on-disk** execution modes. - Accepts an `obiiter.IBioSequence` iterator and a classifier (`obiseq.BioSequenceClassifier`). - Mode selection via `onMemory` flag: routes to either `ISequenceChunkOnMemory` or `ISequenceChunkOnDisk`. - Optional parameters: - `dereplicate`: deduplicate identical sequences per batch. - `na`: defines placeholder for missing/ambiguous characters (e.g., `"N"`, `"?"`). - `statsOn`: enables metadata tracking (e.g., sample IDs, primer names) for statistics. - `uniqueClassifier`: optional secondary classifier to assign unique labels. Returns an iterator over processed sequences (`obiiter.IBioSequence`), supporting streaming pipelines and downstream integration. --- ### `ISequenceChunkOnDisk` Efficiently splits sequences into **temporary on-disk batches** (`.fastx` files), ideal for large datasets. - Automatically manages a temp directory (`obiseq_chunks_*`) and cleans up post-processing. - Uses `find` to discover all generated chunk files recursively. - Asynchronous streaming: batches are yielded via an iterator as they’re written, decoupling production and consumption. - Optional per-batch dereplication using composite keys (sequence + classification). - Logs batch count and start events for monitoring. Internally leverages: - `obiiter.MakeIBioSequence()` to build output iterator. - `obiformats.WriterDispatcher` for parallel file writing. - A dedicated goroutine to read, classify/dereplicate, and emit batches. --- ### `ISequenceChunkOnMemory` Performs **in-memory parallel chunking** of sequences into classification-based batches. - Routes each sequence to a bucket (flux) using the classifier. - Maintains one `BioSequenceSlice` per classification group in memory (thread-safe via mutex). - Emits batches **only after full input consumption**, preserving deterministic batch order (0, 1, …). - Parallel processing: each flux handled in its own goroutine. - Fails fast on internal errors (e.g., channel issues) via `log.Fatalf`. Ideal for RAM-sufficient workloads requiring low-latency, ordered batch output. --- ### `Options` System Configurable pipeline behavior via functional options pattern. - Immutable configuration builder: `MakeOptions([]WithOption)` applies setters to internal struct. - Key options: - **Categorization**: `OptionSubCategory(...)` appends sample/marker labels; `PopCategories()` retrieves first. - **Missing values**: `OptionNAValue(na)` customizes placeholder (default: `"?"`). - **Statistics**: `OptionStatOn(...)` registers fields for metadata tracking. - **Batching**: - `OptionBatchCount(n)` sets number of batches (e.g., for hashing). - `OptionsBatchSize(size)` defines items per batch. - **Concurrency**: `OptionsParallelWorkers(n)`. - **Sorting strategy**: - `OptionSortOnDisk()` enables disk-backed sorting. - `OptionSortOnMemory()` (default) uses RAM-based sort. - **Singleton filtering**: - `OptionsNoSingleton()` excludes singleton reads (count = 1). - `OptionsWithSingleton()` allows them. Defaults drawn from `obidefault`, ensuring reproducibility and ease of use. --- ### `ISequenceSubChunk` Parallel, class-based sorting and re-batching of sequence batches. - Input: iterator over `BioSequenceBatch`, classifier, and worker count. - For each batch: - If size >1: sequences are sorted *in-place* by classification code (via custom `sort.Interface`). - Consecutive sequences with same class are regrouped into new batches. - Uses atomic counters (`nextOrder`) to assign globally increasing order IDs across workers—ensuring deterministic inter-batch ordering. - Preserves input-order *within* each new batch. Use case: preparing sorted, class-homogeneous batches for downstream tasks (e.g., consensus calling or alignment). --- ### `IUniqueSequence` End-to-end **dereplication** pipeline: groups identical sequences, aggregates counts and metadata. - Input iterator + optional `Options`. - Parallelization via configurable workers (falls back to single-threaded if disk sorting enabled). - **Splitting phase**: - Uses `HashClassifier` to partition input deterministically (controlled by `BatchCount`). - **Storage selection**: - In-memory: via `ISequenceChunkOnMemory`. - On-disk: uses `ISequenceSubChunk` + external sort (single worker required). - **Uniqueness logic**: - Composite classifier: sequence identity + optional annotations (sample, primer). - NA handling for missing annotation fields. - **Singleton filtering**: optionally excludes reads with count =1 (`NoSingleton()`). - **Parallel deduplication**: - Workers process chunks via `ISequenceSubChunk` + per-group aggregation. - **Merging**: - Aggregates results via `IMergeSequenceBatch`, preserving counts, stats, and ordering. Scalable from small datasets to terabyte-scale NGS runs.