# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences

The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.

Key capabilities include:

- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.

Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.

Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.