⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,30 @@
+# CSV Import Module for Biological Sequences (`obiformats`)
+
+This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
+
+## Core Features
+
+- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
+- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
+- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
+- **Metadata Handling**:
+  - Special handling for taxonomic IDs (`taxid`, `*_taxid`).
+  - Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
+- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
+- **Multiple Entry Points**:
+  - `ReadCSV`: From any `io.Reader`.
+  - `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
+  - `ReadCSVFromStdin`: Reads from standard input.
+- **Error & Edge Handling**:
+  - Gracefully handles empty files/streams via `ReadEmptyFile`.
+  - Uses structured logging (Logrus) for fatal and informational messages.
+
+## Integration
+
+Designed to integrate with OBItools4’s core types:
+- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
+- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
+
+## Use Case
+
+Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.