⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,33 @@
# Functional Overview of the `obicsv` Package
The `obicsv` package provides a flexible and configurable interface for processing biological sequence data (e.g., FASTA/FASTQ) with support for CSV export and parallelized batch processing.
## Core Concepts
- **Options Pattern**: Uses a builder-style API via `MakeOptions([]WithOption)` to configure behavior.
- **Configurable Processing**: Supports batch size, parallel workers, file I/O mode (append/new), compression handling, and progress tracking.
- **Selective CSV Export**: Fine-grained control over output columns (ID, sequence, quality, taxon, count, definition) and formatting (separator, NA value, custom keys).
- **Default Integration**: Leverages `obidefault` for sensible defaults (e.g., batch size, parallel workers).
## Key Functionalities
| Category | Features |
|---------|----------|
| **I/O Control** | File name, append vs overwrite (`OptionsAppendFile`, `OptionCloseFile`), compression support (`OptionsCompressed`) |
| **Processing Strategy** | Batch size, full-file batch mode (`FullFileBatch`), parallel workers (`ParallelWorkers`), unordered processing (`NoOrder`) |
| **Data Handling** | Skip empty sequences (`SkipEmptySequence`), progress bar display, source tracking |
| **CSV Output Customization** | Toggle columns (`CSVId`, `CSVSequence`, etc.), custom keys via `CSVKey`/`CSVKeys`, separator (`CSVSeparator`) and NA placeholder (`CSVNAValue`), auto-column detection |
## Usage Example
```go
opt := MakeOptions([]WithOption{
OptionFileName("output.csv"),
CSVId(true),
CSVSequence(true),
CSVTaxon(false),
OptionsAppendFile(true),
})
```
This package enables efficient, customizable conversion of biological sequence data to structured CSV format with minimal boilerplate.
@@ -0,0 +1,23 @@
# `obicsv` Package: CSV Export Functionality for Biological Sequences
This Go package provides utilities to serialize biological sequence data (e.g., from NGS pipelines) into CSV format.
## Core Functions
- **`CLIWriteSequenceCSV()`**
Converts an iterator of `IBioSequence` objects into a CSV-compatible stream. It configures parallelism, batching, and compression using default settings (e.g., `obidefault.ParallelWorkers()`), then applies CLI-driven column mappings via helper functions (`CLIPrintId()`, `CLIPrintSequence()`, etc.). Returns an `ICSVRecord` iterator.
- **`CLICSVWriter()`**
Writes the CSV data either to a file (if `obiconvert.CLIOutPutFileName()``"-"`) or to standard output. Handles errors with fatal logging and supports optional terminal consumption of the iterator.
## Key Features
- **Flexible column selection**: Controlled by CLI options (e.g., `CSVTaxon`, `CSVKeys`), allowing selective export of metadata, sequences, quality scores.
- **Compression support**: Output can be gzip-compressed per `obidefault.CompressOutput()`.
- **Parallel processing**: Uses ~¼ of configured workers (min 2) for throughput optimization.
- **CLI integration**: Leverages existing `obiconvert` and CLI abstractions for seamless pipeline usage.
- **Error resilience**: Fails fast on I/O issues with descriptive logs.
## Design Notes
Functions follow a functional-iterator pattern, enabling lazy evaluation and streaming. The `terminalAction` flag determines whether the iterator is consumed immediately (e.g., for final output) or returned for further processing.
@@ -0,0 +1,28 @@
# CSV Export Functionality Overview
This Go package (`obicsv`) provides command-line interface options and utilities for exporting biological sequence data to CSV format. It integrates with the OBITools4 framework, supporting flexible attribute selection and formatting.
## Core Export Options
- **`--ids/-i`**: Outputs sequence identifiers.
- **`--sequence/-s`**: Includes raw nucleotide/amino acid sequences.
- **`--quality/-q`**: Adds per-base quality scores (e.g., Phred values).
- **`--definition/-d`**: Prints sequence headers or definitions.
- **`--count`**: Includes abundance/observation counts per sequence.
## Taxonomic & Pairing Data
- **`--taxon`**: Exports NCBI taxid and corresponding scientific name.
- **`--obipairing`**: Includes metadata added by `obipairing`, such as alignment mode, score, and mismatch count.
## Attribute Filtering
- **`--keep/-k KEY`**: Restricts output to specified attributes (multiple `-k` allowed).
- **`--auto`**: Inspects first records to auto-detect and suggest relevant attributes.
## Configuration
- **`--na-value NAVALUE`**: Sets placeholder string (default `"NA"`) for missing fields.
## Integration
- Extends `obiconvert` input/output and taxonomy-loading options.
- Provides CLI accessor functions (e.g., `CLIPrintSequence()`, `CLIHasToBeKeptAttributes()`).
- Supports soft attribute groups (e.g., `"obipairing"` expands to 8 specific fields).
Designed for high-throughput sequence analysis pipelines, enabling customizable tabular output compatible with downstream tools.
@@ -0,0 +1,27 @@
# CSV Export Functionality in `obicsv` Package
The `obicsv` package provides utilities to convert biological sequence data into structured CSV format. It supports flexible, configurable output through an `Options` interface.
## Core Functions
- **`CSVSequenceHeader(opt Options)`**:
Constructs a CSV header row based on enabled options (e.g., `id`, `count`, `taxid`, `definition`). Additional user-defined attributes are appended, followed by optional `sequence` and `qualities`.
- **`CSVBatchFromSequences(batch BioSequenceBatch, opt Options)`**:
Converts a batch of biological sequences into CSV records. Each sequence is processed according to the active options:
- Sequence ID, count, taxonomic identifier (from `Taxon()` or fallback to raw `taxid`), and definition.
- Custom attributes retrieved via `GetAttribute(key)`; missing values replaced by a configurable NA value.
- Nucleotide sequence (as string) and quality scores (converted to ASCII Phred+shifted format or NA if absent).
- **`NewCSVSequenceIterator(iter IBioSequence, options ...WithOption)`**:
Wraps a sequence iterator (`IBioSequence`) to produce an asynchronous CSV record stream:
- Optionally auto-detects and includes all sequence attributes (`CSVAutoColumn`).
- Launches parallel workers to process batches concurrently.
- Uses a producer-consumer pattern: one goroutine drives iteration, others write CSV records.
## Key Features
- **Configurable output columns** via option flags (e.g., `CSVId()`, `CSVTaxon()`).
- **Support for quality scores** in standard FASTQ ASCII encoding.
- **NA value handling**: missing fields replaced with a user-defined placeholder (e.g., `"."`).
- **Parallelization**: scalable CSV generation using multiple goroutines.
@@ -0,0 +1,21 @@
# CSV Export Functionality in `obicsv` Package
The `obicsv` package provides utilities for efficiently writing structured data (e.g., sequence annotations) to CSV format, supporting parallel processing and streaming.
- **`FormatCVSBatch()`**: Converts a batch of CSV records (`CSVRecordBatch`) into an in-memory buffer, using the provided header and a placeholder for missing values (`navalue`). It prepends the header only once (for batch order 0).
- **`WriteCSV()`**: Writes a CSV-formatted stream from an `ICSVRecord` iterator to any `io.WriteCloser`. It supports:
- Compression (via `obiutils.CompressStream`)
- Parallel workers for batch processing (`ParallelWorkers()`)
- Chunked writing via `obiformats.WriteFileChunk`
- **`WriteCSVToStdout()` / `WriteCSVToFile()`**: Convenience wrappers:
- Outputs to stdout (`os.Stdout`)
- Writes to a file (with `O_WRONLY`, optional append/truncate)
- **Key design features**:
- Non-blocking, concurrent processing using goroutines
- Graceful shutdown via `WaitAndClose()` and channel signaling
- Robust handling of missing/invalid values (falls back to `navalue`)
- **Dependencies**: Leverages internal packages for iteration (`obiitercsv`), data formats (`obiformats`), and utilities (`obiutils`, `logrus` logging).