autodoc/docmd/pkg/obichunk/chunk_on_disk.md

# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences

The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.

Key capabilities include:

- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.

Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.

Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
⬆️ version bump to v4.5 2026-04-07 08:36:50 +02:00			# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences

			The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one batch (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.

			`Key capabilities include:`

			- Temporary Directory Management: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
			- File Discovery: Recursively finds all `.fastx` files generated during chunking via `find`.
			- Asynchronous Streaming: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
			- Optional Dereplication: When enabled (`dereplicate = true`), sequences are deduplicated per batch using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
			`- Logging & Monitoring: Logs total batch count and per-batch processing start events for transparency.`

			Internally, `ISequenceChunkOnDisk` uses:
			- `obiiter.MakeIBioSequence()` to build the output iterator,
			- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
			- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.

			`Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.`