mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

6.6 KiB

Raw Blame History

`obiformats` Package — Semantic Overview

The obiformats package provides a unified, extensible framework for parsing and writing biological sequence data in standard bioinformatics formats (FASTA/FASTQ, EMBL, GenBank, CSV, EcoPCR), while supporting streaming, batching, parallelism, and format-agnostic workflows.

Core Objectives

Format-Agnostic Input: Automatically detect and parse diverse sequence formats via MIME-type inference.
Streaming & Scalability: Enable memory-efficient ingestion of large NGS datasets through chunked, concurrent parsing.
Structured Output: Support flexible export to FASTA/FASTQ, JSON, CSV, Newick, and taxonomy-aware formats.
Interoperability: Integrate seamlessly with OBITools4 abstractions (obiseq.BioSequence, obiiter.IBioSequence, obitax.Taxon).
Extensibility: Allow new readers/writers to be plugged in via functional interfaces and options.

Public Functionalities (Grouped by Domain)

📥 Sequence Reading & Parsing

Function	Format(s) Supported
`ReadSequencesFromFile`, `ReadSequencesFromStdin`	Auto-detected (FASTA/FASTQ/EMBL/GenBank/EcoPCR/CSV)
`ReadFasta`, `ReadFastq`	FASTA, FASTQ (with rope/buffered variants)
`ReadEMBL`, `ReadGenbank`	EMBL, GenBank (rope-aware for large files)
`ReadCSV`, `ReadEcoPCR`	Tabular/amplicon outputs (e.g., EcoPCR v1/v2)
`LoadCSVTaxonomy`, `LoadNCBITaxDump`	Taxonomic data (CSV, NCBI dump dir/tar)

Concurrent Parsing: Configurable worker count (OptionsParallelWorkers) with ordered batch output.
Rope-Based Parsing: Zero-copy parsing for large files (FastaChunkParserRope, EmblChunkParserRope).
Header Parsing: JSON (ParseFastSeqJsonHeader) and legacy OBI-style (ParseOBIFeatures).
Quality Handling: Phred offset adjustment, optional U→T conversion.

📤 Sequence Writing & Formatting

Function	Format(s) Supported
`WriteFasta`, `FormatFastq`	FASTA/FASTQ (single/batch, parallel I/O)
`WriteJSON`	Structured JSON with annotations (batched + ordered writes)
`FormatFastaBatch`, `WriteFastqToFile`	Optimized batch formatting with compression
`CSVTaxaIterator`, `CSVSequenceRecord`	Taxonomic/sequence CSV export (configurable columns)
`WriteNewick`, `Tree.Newick`	Taxonomy → Newick tree (with optional annotations)

Compression Support: Automatic gzip/bgzip via obiutils.CompressStream.
Paired-End Handling: Split forward/reverse reads to separate files.
Ordered Output: Preserves sequence order across parallel writes (WriteFileChunk).
Format-Aware Dispatching: WriteSequence() auto-selects FASTQ/FASTA based on quality presence.

🧬 Taxonomy & Metadata Handling

Function	Purpose
`LoadCSVTaxonomy`, `LoadNCBITarTaxDump`	Load taxonomies from CSV/NCBI dumps
`DetectTaxonomyFormat`, `LoadTaxonomy`	Auto-detect and load taxonomy from diverse sources
`CSVTaxaIterator`, `WriteNewick`	Export taxonomies to CSV or Newick
Taxon annotation extraction (e.g., `taxid`, path, rank)	via structured metadata fields

Root Enforcement: Ensures presence of NCBI root (taxid=1) during loading.
Alias Resolution: Merged taxids mapped to current IDs (AddAlias).
Flexible Output Fields: CSV/Newick support configurable metadata (scientific name, taxid, rank, path).

⚙️ Configuration & Options

Options encapsulates all runtime settings via functional setters (WithOption, e.g., BatchSize(1024), OptionsCompressed(true)).
Key options include:
- I/O: file append/truncate, compression (OptionsCompressed)
- Parsing: header parser toggle, quality read flag
- Export: CSV columns (CSVId, CSVTaxid), NA value, separator
- Taxonomy: include path/root/rank (OptionWithoutRootPath, WithTaxid)
- Performance: parallel workers, buffer size
Defaults ensure safe behavior; options are composable and immutable.

🧵 Streaming & Chunking Primitives

Type/Function	Purpose
`PieceOfChunk`, `FileChunk`	Rope-based buffers for zero-copy streaming
`ReadFileChunk()`	Chunk file by record boundaries (not fixed size)
`EndOfLastFastaEntry`, `EndOfLastFastqEntry`	Find last complete record in buffer (for safe splitting)
`ropeScanner`, `_readline__`	Line-by-line scanner over ropes (no full materialization)
`WriteFileChunk()`	Ordered, thread-safe chunk reassembly

Designed for large-file resilience: avoids full file load; splits only at valid boundaries.
Integrates with obiiter for push-style streaming iterators.

🔍 Format Detection & Discovery

Function	Role
`OBIMimeTypeGuesser`, `NGSFilterCsvDetector`	Content-based MIME detection (e.g., FASTA via `>`, EcoPCR via `#@ecopcr-v2`)
`DetectTaxonomyFormat`	Detects NCBI dump, CSV, FASTA/FASTQ as taxonomy sources
`OBIMimeNGSFilterTypeGuesser`	Distinguishes legacy vs. CSV NGS filter configs

Uses github.com/gabriel-vasile/mimetype for robust format sniffing.
Preserves unread bytes to allow downstream parsers.

📋 Specialized Parsers & Writers

ReadCSVFromStdin, _ParseFastqFile: Convenience wrappers for stdin/file I/O.
JSONRecord(), FormatFastaBatch(): Optimized serialization with minimal allocations.
_parse_json_* helpers: High-performance JSON parsing using jsonparser.
WriteFastaToFile, _UnescapeUnicodeCharactersInJSON(): Robust output handling.

Design Principles

Streaming First: All parsers return obiiter.IBioSequence — lazy, batched iterators.
Functional Abstraction: Format handling via IBatchReader, FormatHeader — decoupled from core logic.
Extensibility: New formats added via ReadSequencesFromFile() extension points and MIME registration.
Fail-Safe Defaults: Empty files → empty iterator; missing root taxon → fatal error.
Ordered Semantics: Despite parallelism, batches preserve global order via atomic counters (nextCounter).

Integration Highlights

Dependencies: Uses obiseq, obiiter, obitax, and utilities (obiutils/obidefault) for core data models.
Logging: Structured logs via logrus (format detection, errors, progress).
Error Handling: Panics on unrecoverable issues; graceful fallbacks (e.g., ReadEmptyFile).
Performance: Rope-based parsing, zero-copy where possible (unsafe.String, buffered writes).

✅ obiformats enables scalable, reproducible NGS data processing — from raw ingestion to structured export.

6.6 KiB Raw Blame History

obiformats Package — Semantic Overview