Files
obitools4/autodoc/docmd/pkg/obiformats/ecopcr_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

30 lines
1.8 KiB
Markdown

# EcoPCR File Parser for Biological Sequences
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
## Key Features
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
- Name (with deduplication support)
- Nucleotide/protein sequence
- Comment field
- **Structured Annotation**: Populates rich annotations including:
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
- Primer matching info (`forward_match`, `reverse_mismatch`)
- Melting temperatures (if present in v2)
- Amplicon length and strand orientation
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
## Implementation Highlights
- Custom line reader (`__readline__`) for robust header parsing.
- CSV parser configured with `|` delimiter and comment support (`#`).
- Deduplication of sequence names using a running count suffix.
- Concurrent goroutine-based streaming to decouple I/O and processing.
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.