Files
obitools4/autodoc/docmd/pkg/obiformats/embl_read.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

18 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# EMBL Format Parser for OBITools4
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
- **Two Parsing Modes**:
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
- **Configurable Options**:
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
- **Integration**: Outputs are compatible with OBITools4s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.