Files
obitools4/autodoc/docmd/pkg_obichunk.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

5.2 KiB
Raw Blame History

obichunk: High-Performance Chunking and Dereplication of Biological Sequences

The obichunk package provides scalable, configurable infrastructure for preprocessing large-scale biological sequence data (e.g., FASTA/FASTQ). It enables efficient grouping, sorting, deduplication, and batched streaming of sequences—critical for metabarcoding, metagenomics, or any high-throughput NGS workflow.


Core Functionalities

ISequenceChunk

Unified entry point for sequence chunking, supporting both in-memory and on-disk execution modes.

  • Accepts an obiiter.IBioSequence iterator and a classifier (obiseq.BioSequenceClassifier).
  • Mode selection via onMemory flag: routes to either ISequenceChunkOnMemory or ISequenceChunkOnDisk.
  • Optional parameters:
    • dereplicate: deduplicate identical sequences per batch.
    • na: defines placeholder for missing/ambiguous characters (e.g., "N", "?").
    • statsOn: enables metadata tracking (e.g., sample IDs, primer names) for statistics.
    • uniqueClassifier: optional secondary classifier to assign unique labels.

Returns an iterator over processed sequences (obiiter.IBioSequence), supporting streaming pipelines and downstream integration.


ISequenceChunkOnDisk

Efficiently splits sequences into temporary on-disk batches (.fastx files), ideal for large datasets.

  • Automatically manages a temp directory (obiseq_chunks_*) and cleans up post-processing.
  • Uses find to discover all generated chunk files recursively.
  • Asynchronous streaming: batches are yielded via an iterator as theyre written, decoupling production and consumption.
  • Optional per-batch dereplication using composite keys (sequence + classification).
  • Logs batch count and start events for monitoring.

Internally leverages:

  • obiiter.MakeIBioSequence() to build output iterator.
  • obiformats.WriterDispatcher for parallel file writing.
  • A dedicated goroutine to read, classify/dereplicate, and emit batches.

ISequenceChunkOnMemory

Performs in-memory parallel chunking of sequences into classification-based batches.

  • Routes each sequence to a bucket (flux) using the classifier.
  • Maintains one BioSequenceSlice per classification group in memory (thread-safe via mutex).
  • Emits batches only after full input consumption, preserving deterministic batch order (0, 1, …).
  • Parallel processing: each flux handled in its own goroutine.
  • Fails fast on internal errors (e.g., channel issues) via log.Fatalf.

Ideal for RAM-sufficient workloads requiring low-latency, ordered batch output.


Options System

Configurable pipeline behavior via functional options pattern.

  • Immutable configuration builder: MakeOptions([]WithOption) applies setters to internal struct.
  • Key options:
    • Categorization: OptionSubCategory(...) appends sample/marker labels; PopCategories() retrieves first.
    • Missing values: OptionNAValue(na) customizes placeholder (default: "?").
    • Statistics: OptionStatOn(...) registers fields for metadata tracking.
    • Batching:
      • OptionBatchCount(n) sets number of batches (e.g., for hashing).
      • OptionsBatchSize(size) defines items per batch.
    • Concurrency: OptionsParallelWorkers(n).
    • Sorting strategy:
      • OptionSortOnDisk() enables disk-backed sorting.
      • OptionSortOnMemory() (default) uses RAM-based sort.
    • Singleton filtering:
      • OptionsNoSingleton() excludes singleton reads (count = 1).
      • OptionsWithSingleton() allows them.

Defaults drawn from obidefault, ensuring reproducibility and ease of use.


ISequenceSubChunk

Parallel, class-based sorting and re-batching of sequence batches.

  • Input: iterator over BioSequenceBatch, classifier, and worker count.
  • For each batch:
    • If size >1: sequences are sorted in-place by classification code (via custom sort.Interface).
    • Consecutive sequences with same class are regrouped into new batches.
  • Uses atomic counters (nextOrder) to assign globally increasing order IDs across workers—ensuring deterministic inter-batch ordering.
  • Preserves input-order within each new batch.

Use case: preparing sorted, class-homogeneous batches for downstream tasks (e.g., consensus calling or alignment).


IUniqueSequence

End-to-end dereplication pipeline: groups identical sequences, aggregates counts and metadata.

  • Input iterator + optional Options.
  • Parallelization via configurable workers (falls back to single-threaded if disk sorting enabled).
  • Splitting phase:
    • Uses HashClassifier to partition input deterministically (controlled by BatchCount).
  • Storage selection:
    • In-memory: via ISequenceChunkOnMemory.
    • On-disk: uses ISequenceSubChunk + external sort (single worker required).
  • Uniqueness logic:
    • Composite classifier: sequence identity + optional annotations (sample, primer).
    • NA handling for missing annotation fields.
  • Singleton filtering: optionally excludes reads with count =1 (NoSingleton()).
  • Parallel deduplication:
    • Workers process chunks via ISequenceSubChunk + per-group aggregation.
  • Merging:
    • Aggregates results via IMergeSequenceBatch, preserving counts, stats, and ordering.

Scalable from small datasets to terabyte-scale NGS runs.