Files
obitools4/autodoc/docmd/pkg/obiformats/ecopcr_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB

EcoPCR File Parser for Biological Sequences

This Go package (obiformats) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the EcoPCR tool (used in metabarcoding pipelines). The parser supports two versions of the format (v1 and v2) and extracts rich biological metadata alongside sequences.

Key Features

  • Version Detection: Automatically detects EcoPCR file version via the #@ecopcr-v2 header.
  • Primer Extraction: Reads forward and reverse primer sequences from comment lines in the file header.
  • Mode Inference: Identifies amplification mode (e.g., direct, inverted) from header metadata.
  • Sequence Parsing: Reads each record as a biological sequence (obiseq.BioSequence) with:
    • Name (with deduplication support)
    • Nucleotide/protein sequence
    • Comment field
  • Structured Annotation: Populates rich annotations including:
    • Taxonomic hierarchy (taxid, rank, species/genus/family names)
    • Primer matching info (forward_match, reverse_mismatch)
    • Melting temperatures (if present in v2)
    • Amplicon length and strand orientation
  • Streaming & Batching: Returns an iterator (obiiter.IBioSequence) for memory-efficient, batched processing of large files.
  • File Handling: Provides both ReadEcoPCR (from any io.Reader) and ReadEcoPCRFromFile convenience functions.

Implementation Highlights

  • Custom line reader (__readline__) for robust header parsing.
  • CSV parser configured with | delimiter and comment support (#).
  • Deduplication of sequence names using a running count suffix.
  • Concurrent goroutine-based streaming to decouple I/O and processing.

This module integrates with the broader OBItools4 ecosystem for high-throughput sequence analysis in environmental DNA studies.