mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.7 KiB
1.7 KiB
obichunk Package: On-Disk Chunking and Dereplication of Biosequences
The obichunk package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the ISequenceChunkOnDisk function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one batch (e.g., chunk_*.fastx), enabling scalable, parallel-friendly workflows.
Key capabilities include:
- Temporary Directory Management: Automatically creates and cleans up a system temp directory (
obiseq_chunks_*) for intermediate storage. - File Discovery: Recursively finds all
.fastxfiles generated during chunking viafind. - Asynchronous Streaming: Returns an iterator (
obiiter.IBioSequence) that yields batches asynchronously, decoupling chunk creation from consumption. - Optional Dereplication: When enabled (
dereplicate = true), sequences are deduplicated per batch using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics. - Logging & Monitoring: Logs total batch count and per-batch processing start events for transparency.
Internally, ISequenceChunkOnDisk uses:
obiiter.MakeIBioSequence()to build the output iterator,obiformats.WriterDispatcherfor parallel writing of distributed sequences into chunk files,- and a second goroutine to read, optionally dereplicate (via
BioSequenceClassifier), and push batches back into the output iterator.
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.