Files
obitools4/autodoc/docmd/pkg/obichunk/chunk_on_disk.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.7 KiB

obichunk Package: On-Disk Chunking and Dereplication of Biosequences

The obichunk package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the ISequenceChunkOnDisk function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one batch (e.g., chunk_*.fastx), enabling scalable, parallel-friendly workflows.

Key capabilities include:

  • Temporary Directory Management: Automatically creates and cleans up a system temp directory (obiseq_chunks_*) for intermediate storage.
  • File Discovery: Recursively finds all .fastx files generated during chunking via find.
  • Asynchronous Streaming: Returns an iterator (obiiter.IBioSequence) that yields batches asynchronously, decoupling chunk creation from consumption.
  • Optional Dereplication: When enabled (dereplicate = true), sequences are deduplicated per batch using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
  • Logging & Monitoring: Logs total batch count and per-batch processing start events for transparency.

Internally, ISequenceChunkOnDisk uses:

  • obiiter.MakeIBioSequence() to build the output iterator,
  • obiformats.WriterDispatcher for parallel writing of distributed sequences into chunk files,
  • and a second goroutine to read, optionally dereplicate (via BioSequenceClassifier), and push batches back into the output iterator.

Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.