autodoc/docmd/pkg/obiformats/csv_read.md

# CSV Import Module for Biological Sequences (`obiformats`)

This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.

## Core Features

- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
- **Metadata Handling**:
  - Special handling for taxonomic IDs (`taxid`, `*_taxid`).
  - Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
- **Multiple Entry Points**:
  - `ReadCSV`: From any `io.Reader`.
  - `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
  - `ReadCSVFromStdin`: Reads from standard input.
- **Error & Edge Handling**:
  - Gracefully handles empty files/streams via `ReadEmptyFile`.
  - Uses structured logging (Logrus) for fatal and informational messages.

## Integration

Designed to integrate with OBItools4’s core types:
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.

## Use Case

Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
-											⬆️ version bump to v4.5
										
										
											2026-04-07 08:36:50 +02:00
+								# CSV Import Module for Biological Sequences (`obiformats`)
 								This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
 								## Core Features
 								- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
 								- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
 								- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
 								- **Metadata Handling**:
 								  - Special handling for taxonomic IDs (`taxid`, `*_taxid`).
 								  - Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
 								- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
 								- **Multiple Entry Points**:
 								  - `ReadCSV`: From any `io.Reader`.
 								  - `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
 								  - `ReadCSVFromStdin`: Reads from standard input.
 								- **Error & Edge Handling**:
 								  - Gracefully handles empty files/streams via `ReadEmptyFile`.
 								  - Uses structured logging (Logrus) for fatal and informational messages.
 								## Integration
 								Designed to integrate with OBItools4’s core types:
 								- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
 								- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
 								## Use Case
 								Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.