mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
31 lines
1.7 KiB
Markdown
31 lines
1.7 KiB
Markdown
|
|
# CSV Import Module for Biological Sequences (`obiformats`)
|
|||
|
|
|
|||
|
|
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
|
|||
|
|
|
|||
|
|
## Core Features
|
|||
|
|
|
|||
|
|
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
|
|||
|
|
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
|
|||
|
|
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
|
|||
|
|
- **Metadata Handling**:
|
|||
|
|
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
|
|||
|
|
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
|
|||
|
|
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
|
|||
|
|
- **Multiple Entry Points**:
|
|||
|
|
- `ReadCSV`: From any `io.Reader`.
|
|||
|
|
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
|
|||
|
|
- `ReadCSVFromStdin`: Reads from standard input.
|
|||
|
|
- **Error & Edge Handling**:
|
|||
|
|
- Gracefully handles empty files/streams via `ReadEmptyFile`.
|
|||
|
|
- Uses structured logging (Logrus) for fatal and informational messages.
|
|||
|
|
|
|||
|
|
## Integration
|
|||
|
|
|
|||
|
|
Designed to integrate with OBItools4’s core types:
|
|||
|
|
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
|
|||
|
|
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
|
|||
|
|
|
|||
|
|
## Use Case
|
|||
|
|
|
|||
|
|
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
|