mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
19 lines
1.7 KiB
Markdown
19 lines
1.7 KiB
Markdown
|
|
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
|
||
|
|
|
||
|
|
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
|
||
|
|
|
||
|
|
Key capabilities include:
|
||
|
|
|
||
|
|
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
|
||
|
|
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
|
||
|
|
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
|
||
|
|
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
|
||
|
|
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
|
||
|
|
|
||
|
|
Internally, `ISequenceChunkOnDisk` uses:
|
||
|
|
- `obiiter.MakeIBioSequence()` to build the output iterator,
|
||
|
|
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
|
||
|
|
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
|
||
|
|
|
||
|
|
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
|