mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
56 lines
2.9 KiB
Markdown
56 lines
2.9 KiB
Markdown
|
|
# Functional Overview of the `obicsv` Package
|
||
|
|
|
||
|
|
The `obicsv` package enables efficient, configurable export of biological sequence data (e.g., FASTA/FASTQ) to CSV format. It supports selective column inclusion, parallel batch processing, compression, and seamless CLI integration—ideal for high-throughput NGS pipelines.
|
||
|
|
|
||
|
|
## Core Capabilities
|
||
|
|
|
||
|
|
| **Domain** | **Features** |
|
||
|
|
|-----------|--------------|
|
||
|
|
| **Column Selection & Formatting** | Toggle output fields (`CSVId`, `CSVSequence`, `CSVTaxon`, etc.); define custom attributes via `CSVKey`/`CSVKeys`; set separator (`CSVSeparator`) and NA placeholder (`CSVNAValue`). |
|
||
|
|
| **I/O & File Handling** | Write to stdout or file (append/truncate); support gzip compression (`OptionsCompressed`); configure batch size and full-file batching. |
|
||
|
|
| **Processing Strategy** | Parallel workers (default: `obidefault.ParallelWorkers()`); unordered iteration (`NoOrder`); progress tracking; skip empty sequences. |
|
||
|
|
| **Metadata Enrichment** | Auto-detect columns (`CSVAutoColumn`); integrate `obipairing`, taxonomic data, and abundance counts; support Phred+shifted quality scores. |
|
||
|
|
| **CLI Integration** | Command-line flags (`--ids`, `--sequence`, `--taxon`, etc.); extendable via helper functions (`CLIPrintId()`, `CLIHasToBeKeptAttributes()`). |
|
||
|
|
|
||
|
|
## Public API Summary
|
||
|
|
|
||
|
|
- **`MakeOptions([]WithOption)`**
|
||
|
|
Builder-style configuration of export behavior. Supported options: `CSVId`, `CSVTaxon`, `OptionsFileName`, `OptionAppendFile`, etc.
|
||
|
|
|
||
|
|
- **`NewCSVSequenceIterator(IBioSequence, ...WithOption)`**
|
||
|
|
Wraps a sequence iterator into an async CSV record stream. Launches parallel workers, handles batching, and auto-detects attributes when enabled.
|
||
|
|
|
||
|
|
- **`CSVSequenceHeader(Options)`**
|
||
|
|
Generates a CSV header row based on enabled columns and custom keys.
|
||
|
|
|
||
|
|
- **`CSVBatchFromSequences(BioSequenceBatch, Options)`**
|
||
|
|
Converts a batch of sequences into `CSVRecord` entries per configured options.
|
||
|
|
|
||
|
|
- **`WriteCSV(ICSVRecord, io.WriteCloser)`**
|
||
|
|
Writes CSV data to any writer with compression and parallelization support.
|
||
|
|
|
||
|
|
- **`WriteCSVToStdout()`, `WriteCSVToFile()`**
|
||
|
|
Convenience wrappers for common I/O targets.
|
||
|
|
|
||
|
|
- **`FormatCVSBatch(CSVRecordBatch, string)`**
|
||
|
|
Renders a batch of records as an in-memory CSV buffer (header prepended only for first chunk).
|
||
|
|
|
||
|
|
## Design Principles
|
||
|
|
|
||
|
|
- **Streaming & Laziness**: Uses iterator patterns to avoid full data loading.
|
||
|
|
- **Parallelism**: Producer-consumer model with configurable concurrency (min 2 workers).
|
||
|
|
- **Resilience**: Graceful handling of missing fields via configurable NA values.
|
||
|
|
- **Extensibility**: Supports dynamic attributes (e.g., `obipairing` expands to 8 fields).
|
||
|
|
|
||
|
|
## Usage Example
|
||
|
|
```go
|
||
|
|
opt := MakeOptions([]WithOption{
|
||
|
|
OptionFileName("results.csv"),
|
||
|
|
CSVId(true),
|
||
|
|
CSVTaxon(false),
|
||
|
|
OptionsAppendFile(true),
|
||
|
|
})
|
||
|
|
iter := NewCSVSequenceIterator(sourceIter, opt)
|
||
|
|
WriteCSV(iter, os.Stdout) // or file
|
||
|
|
```
|