Files
obitools4/autodoc/docmd/pkg_obitools_obidistribute.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.2 KiB

obidistribute Package: Semantic Description of Public Functionalities

The obidistribute package enables flexible, scalable distribution of biological sequence data into multiple output files or directories. It supports annotation-based separation (e.g., sample IDs), batch splitting, and hash-sharded distribution — all while integrating with standard NGS formats (FASTA/FASTQ) and the broader obitools4 ecosystem.

Core Functionalities

1. Sequence Distribution Strategy Selection

  • CLIOutputFormat()
    Specifies output format: "fastq", "fasta", or generic sequence (e.g., FASTA-like). Controls how sequences are serialized to disk.

  • CLISequenceClassifier()
    Selects the annotation key used for classification (e.g., "sample_id", "taxon"). Sequences are grouped by the value of this annotation field.

  • CLIDistributeByBatches() / -n
    Enables round-robin assignment of sequences into N fixed batches, regardless of metadata.

  • CLIDistributeByHash() / -H
    Distributes sequences deterministically into N batches using a hash of the sequence ID or annotation — ensures reproducible sharding.

  • CLIDirectoryMode() / -d
    When used with a classifier, organizes output files into subdirectories named after classification values.

2. Output Naming & File Management

  • CLIFileNamePattern() / -p
    Defines a printf-style filename template (e.g., "sample_%s.fastq"), where %s is replaced by the classifier value or batch index.

  • CLIAppendSequences() / -A
    Enables appending to existing files instead of overwriting them.

  • CLINAValue() / --na-value
    Sets the fallback label (default: "NA") for sequences missing a classifier annotation.

3. Processing Configuration

  • CLIDistributeSequence()
    Main entry point: orchestrates input iteration, classification, batching, and parallelized writing. Accepts an obiiter.IBioSequence iterator.

  • Parallel workers are derived from obidefault.ParallelWorkers() (minimum 2), divided by four.

  • Batch size and compression settings are inherited from obidefault.

4. Header & Format Handling

  • CLIOutputFastHeaderFormat()
    Configures header serialization format: "json" (default) or "obi". Controls metadata inclusion in output headers.

  • Built on top of obiconvert.CLIOutputFastHeaderFormat() and integrates with its header parsing/writing logic.

5. Validation & CLI Integration

  • Uses getoptions for option parsing; enforces mutual exclusivity of distribution modes.
  • Validates filename pattern syntax and required arguments at startup.

Semantic Workflow

  1. User selects a distribution mode (classifier, batches, or hash) and optional directory nesting.
  2. Sequences are read via an iterator; each is classified or assigned to a batch/shard.
  3. Sequences are buffered in batches, compressed if configured, and written to output files using the selected format.
  4. Filenames are generated dynamically per classification value or batch index, respecting append mode and NA fallbacks.

This module is essential for demultiplexing, batch processing, and scalable data management in high-throughput sequencing pipelines — especially metabarcoding workflows.