mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,33 @@
|
||||
# Functional Overview of the `obicsv` Package
|
||||
|
||||
The `obicsv` package provides a flexible and configurable interface for processing biological sequence data (e.g., FASTA/FASTQ) with support for CSV export and parallelized batch processing.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **Options Pattern**: Uses a builder-style API via `MakeOptions([]WithOption)` to configure behavior.
|
||||
- **Configurable Processing**: Supports batch size, parallel workers, file I/O mode (append/new), compression handling, and progress tracking.
|
||||
- **Selective CSV Export**: Fine-grained control over output columns (ID, sequence, quality, taxon, count, definition) and formatting (separator, NA value, custom keys).
|
||||
- **Default Integration**: Leverages `obidefault` for sensible defaults (e.g., batch size, parallel workers).
|
||||
|
||||
## Key Functionalities
|
||||
|
||||
| Category | Features |
|
||||
|---------|----------|
|
||||
| **I/O Control** | File name, append vs overwrite (`OptionsAppendFile`, `OptionCloseFile`), compression support (`OptionsCompressed`) |
|
||||
| **Processing Strategy** | Batch size, full-file batch mode (`FullFileBatch`), parallel workers (`ParallelWorkers`), unordered processing (`NoOrder`) |
|
||||
| **Data Handling** | Skip empty sequences (`SkipEmptySequence`), progress bar display, source tracking |
|
||||
| **CSV Output Customization** | Toggle columns (`CSVId`, `CSVSequence`, etc.), custom keys via `CSVKey`/`CSVKeys`, separator (`CSVSeparator`) and NA placeholder (`CSVNAValue`), auto-column detection |
|
||||
|
||||
## Usage Example
|
||||
|
||||
```go
|
||||
opt := MakeOptions([]WithOption{
|
||||
OptionFileName("output.csv"),
|
||||
CSVId(true),
|
||||
CSVSequence(true),
|
||||
CSVTaxon(false),
|
||||
OptionsAppendFile(true),
|
||||
})
|
||||
```
|
||||
|
||||
This package enables efficient, customizable conversion of biological sequence data to structured CSV format with minimal boilerplate.
|
||||
@@ -0,0 +1,23 @@
|
||||
# `obicsv` Package: CSV Export Functionality for Biological Sequences
|
||||
|
||||
This Go package provides utilities to serialize biological sequence data (e.g., from NGS pipelines) into CSV format.
|
||||
|
||||
## Core Functions
|
||||
|
||||
- **`CLIWriteSequenceCSV()`**
|
||||
Converts an iterator of `IBioSequence` objects into a CSV-compatible stream. It configures parallelism, batching, and compression using default settings (e.g., `obidefault.ParallelWorkers()`), then applies CLI-driven column mappings via helper functions (`CLIPrintId()`, `CLIPrintSequence()`, etc.). Returns an `ICSVRecord` iterator.
|
||||
|
||||
- **`CLICSVWriter()`**
|
||||
Writes the CSV data either to a file (if `obiconvert.CLIOutPutFileName()` ≠ `"-"`) or to standard output. Handles errors with fatal logging and supports optional terminal consumption of the iterator.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Flexible column selection**: Controlled by CLI options (e.g., `CSVTaxon`, `CSVKeys`), allowing selective export of metadata, sequences, quality scores.
|
||||
- **Compression support**: Output can be gzip-compressed per `obidefault.CompressOutput()`.
|
||||
- **Parallel processing**: Uses ~¼ of configured workers (min 2) for throughput optimization.
|
||||
- **CLI integration**: Leverages existing `obiconvert` and CLI abstractions for seamless pipeline usage.
|
||||
- **Error resilience**: Fails fast on I/O issues with descriptive logs.
|
||||
|
||||
## Design Notes
|
||||
|
||||
Functions follow a functional-iterator pattern, enabling lazy evaluation and streaming. The `terminalAction` flag determines whether the iterator is consumed immediately (e.g., for final output) or returned for further processing.
|
||||
@@ -0,0 +1,28 @@
|
||||
# CSV Export Functionality Overview
|
||||
|
||||
This Go package (`obicsv`) provides command-line interface options and utilities for exporting biological sequence data to CSV format. It integrates with the OBITools4 framework, supporting flexible attribute selection and formatting.
|
||||
|
||||
## Core Export Options
|
||||
- **`--ids/-i`**: Outputs sequence identifiers.
|
||||
- **`--sequence/-s`**: Includes raw nucleotide/amino acid sequences.
|
||||
- **`--quality/-q`**: Adds per-base quality scores (e.g., Phred values).
|
||||
- **`--definition/-d`**: Prints sequence headers or definitions.
|
||||
- **`--count`**: Includes abundance/observation counts per sequence.
|
||||
|
||||
## Taxonomic & Pairing Data
|
||||
- **`--taxon`**: Exports NCBI taxid and corresponding scientific name.
|
||||
- **`--obipairing`**: Includes metadata added by `obipairing`, such as alignment mode, score, and mismatch count.
|
||||
|
||||
## Attribute Filtering
|
||||
- **`--keep/-k KEY`**: Restricts output to specified attributes (multiple `-k` allowed).
|
||||
- **`--auto`**: Inspects first records to auto-detect and suggest relevant attributes.
|
||||
|
||||
## Configuration
|
||||
- **`--na-value NAVALUE`**: Sets placeholder string (default `"NA"`) for missing fields.
|
||||
|
||||
## Integration
|
||||
- Extends `obiconvert` input/output and taxonomy-loading options.
|
||||
- Provides CLI accessor functions (e.g., `CLIPrintSequence()`, `CLIHasToBeKeptAttributes()`).
|
||||
- Supports soft attribute groups (e.g., `"obipairing"` expands to 8 specific fields).
|
||||
|
||||
Designed for high-throughput sequence analysis pipelines, enabling customizable tabular output compatible with downstream tools.
|
||||
@@ -0,0 +1,27 @@
|
||||
# CSV Export Functionality in `obicsv` Package
|
||||
|
||||
The `obicsv` package provides utilities to convert biological sequence data into structured CSV format. It supports flexible, configurable output through an `Options` interface.
|
||||
|
||||
## Core Functions
|
||||
|
||||
- **`CSVSequenceHeader(opt Options)`**:
|
||||
Constructs a CSV header row based on enabled options (e.g., `id`, `count`, `taxid`, `definition`). Additional user-defined attributes are appended, followed by optional `sequence` and `qualities`.
|
||||
|
||||
- **`CSVBatchFromSequences(batch BioSequenceBatch, opt Options)`**:
|
||||
Converts a batch of biological sequences into CSV records. Each sequence is processed according to the active options:
|
||||
- Sequence ID, count, taxonomic identifier (from `Taxon()` or fallback to raw `taxid`), and definition.
|
||||
- Custom attributes retrieved via `GetAttribute(key)`; missing values replaced by a configurable NA value.
|
||||
- Nucleotide sequence (as string) and quality scores (converted to ASCII Phred+shifted format or NA if absent).
|
||||
|
||||
- **`NewCSVSequenceIterator(iter IBioSequence, options ...WithOption)`**:
|
||||
Wraps a sequence iterator (`IBioSequence`) to produce an asynchronous CSV record stream:
|
||||
- Optionally auto-detects and includes all sequence attributes (`CSVAutoColumn`).
|
||||
- Launches parallel workers to process batches concurrently.
|
||||
- Uses a producer-consumer pattern: one goroutine drives iteration, others write CSV records.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Configurable output columns** via option flags (e.g., `CSVId()`, `CSVTaxon()`).
|
||||
- **Support for quality scores** in standard FASTQ ASCII encoding.
|
||||
- **NA value handling**: missing fields replaced with a user-defined placeholder (e.g., `"."`).
|
||||
- **Parallelization**: scalable CSV generation using multiple goroutines.
|
||||
@@ -0,0 +1,21 @@
|
||||
# CSV Export Functionality in `obicsv` Package
|
||||
|
||||
The `obicsv` package provides utilities for efficiently writing structured data (e.g., sequence annotations) to CSV format, supporting parallel processing and streaming.
|
||||
|
||||
- **`FormatCVSBatch()`**: Converts a batch of CSV records (`CSVRecordBatch`) into an in-memory buffer, using the provided header and a placeholder for missing values (`navalue`). It prepends the header only once (for batch order 0).
|
||||
|
||||
- **`WriteCSV()`**: Writes a CSV-formatted stream from an `ICSVRecord` iterator to any `io.WriteCloser`. It supports:
|
||||
- Compression (via `obiutils.CompressStream`)
|
||||
- Parallel workers for batch processing (`ParallelWorkers()`)
|
||||
- Chunked writing via `obiformats.WriteFileChunk`
|
||||
|
||||
- **`WriteCSVToStdout()` / `WriteCSVToFile()`**: Convenience wrappers:
|
||||
- Outputs to stdout (`os.Stdout`)
|
||||
- Writes to a file (with `O_WRONLY`, optional append/truncate)
|
||||
|
||||
- **Key design features**:
|
||||
- Non-blocking, concurrent processing using goroutines
|
||||
- Graceful shutdown via `WaitAndClose()` and channel signaling
|
||||
- Robust handling of missing/invalid values (falls back to `navalue`)
|
||||
|
||||
- **Dependencies**: Leverages internal packages for iteration (`obiitercsv`), data formats (`obiformats`), and utilities (`obiutils`, `logrus` logging).
|
||||
Reference in New Issue
Block a user