mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
30 lines
1.8 KiB
Markdown
30 lines
1.8 KiB
Markdown
|
|
# EcoPCR File Parser for Biological Sequences
|
||
|
|
|
||
|
|
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
|
||
|
|
|
||
|
|
## Key Features
|
||
|
|
|
||
|
|
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
|
||
|
|
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
|
||
|
|
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
|
||
|
|
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
|
||
|
|
- Name (with deduplication support)
|
||
|
|
- Nucleotide/protein sequence
|
||
|
|
- Comment field
|
||
|
|
- **Structured Annotation**: Populates rich annotations including:
|
||
|
|
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
|
||
|
|
- Primer matching info (`forward_match`, `reverse_mismatch`)
|
||
|
|
- Melting temperatures (if present in v2)
|
||
|
|
- Amplicon length and strand orientation
|
||
|
|
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
|
||
|
|
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
|
||
|
|
|
||
|
|
## Implementation Highlights
|
||
|
|
|
||
|
|
- Custom line reader (`__readline__`) for robust header parsing.
|
||
|
|
- CSV parser configured with `|` delimiter and comment support (`#`).
|
||
|
|
- Deduplication of sequence names using a running count suffix.
|
||
|
|
- Concurrent goroutine-based streaming to decouple I/O and processing.
|
||
|
|
|
||
|
|
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
|