mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,43 @@
|
||||
# Semantic Description of `ReadSequencesBatchFromFiles`
|
||||
|
||||
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
|
||||
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
- Launches `concurrent_readers` goroutines to process files in parallel.
|
||||
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
|
||||
|
||||
## Streaming Interface
|
||||
|
||||
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
|
||||
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
|
||||
|
||||
## Error Handling & Logging
|
||||
|
||||
- Panics on file-open failure (via `log.Panicf`).
|
||||
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
|
||||
|
||||
## Resource Management
|
||||
|
||||
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
|
||||
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
|
||||
|
||||
## Design Intent
|
||||
|
||||
- Enables scalable, memory-efficient ingestion of large NGS datasets.
|
||||
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
|
||||
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Type/Interface | Role |
|
||||
|----------------|------|
|
||||
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
|
||||
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
|
||||
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
|
||||
|
||||
@@ -0,0 +1,36 @@
|
||||
# `obiformats` Package — Semantic Overview
|
||||
|
||||
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
|
||||
|
||||
## Core Abstraction
|
||||
|
||||
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
|
||||
```go
|
||||
func(string, ...WithOption) (obiiter.IBioSequence, error)
|
||||
```
|
||||
- It accepts:
|
||||
- A file path (`string`)
|
||||
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
|
||||
- Returns:
|
||||
- An iterator over biological sequences (`obiiter.IBioSequence`)
|
||||
- Or an error if the file cannot be opened/parsed
|
||||
|
||||
## Semantic Intent
|
||||
|
||||
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
|
||||
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
|
||||
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
|
||||
|
||||
## Typical Usage Pattern
|
||||
|
||||
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
|
||||
2. Call it with a file path and optional options.
|
||||
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Functional, minimal API**: Single responsibility—reading and iteration.
|
||||
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
|
||||
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
|
||||
|
||||
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
|
||||
@@ -0,0 +1,30 @@
|
||||
# CSV Import Module for Biological Sequences (`obiformats`)
|
||||
|
||||
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
|
||||
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
|
||||
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
|
||||
- **Metadata Handling**:
|
||||
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
|
||||
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
|
||||
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
|
||||
- **Multiple Entry Points**:
|
||||
- `ReadCSV`: From any `io.Reader`.
|
||||
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
|
||||
- `ReadCSVFromStdin`: Reads from standard input.
|
||||
- **Error & Edge Handling**:
|
||||
- Gracefully handles empty files/streams via `ReadEmptyFile`.
|
||||
- Uses structured logging (Logrus) for fatal and informational messages.
|
||||
|
||||
## Integration
|
||||
|
||||
Designed to integrate with OBItools4’s core types:
|
||||
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
|
||||
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
|
||||
|
||||
## Use Case
|
||||
|
||||
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
|
||||
@@ -0,0 +1,22 @@
|
||||
# CSVSequenceRecord Function Description
|
||||
|
||||
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
|
||||
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
|
||||
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
|
||||
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
|
||||
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
|
||||
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
|
||||
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
|
||||
- Handles missing data consistently via `opt.CSVNAValue()`.
|
||||
- Supports both standard and user-defined metadata fields.
|
||||
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
|
||||
|
||||
This function enables interoperable, configurable export of sequence data to tabular formats.
|
||||
@@ -0,0 +1,24 @@
|
||||
# `CSVTaxaIterator` Function — Semantic Description
|
||||
|
||||
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
|
||||
|
||||
### Core Functionality:
|
||||
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
|
||||
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
|
||||
|
||||
### Configurable Output Fields (via options):
|
||||
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
|
||||
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
|
||||
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
|
||||
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
|
||||
- `scientific_name`: Full scientific name of the taxon.
|
||||
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
|
||||
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
|
||||
|
||||
### Implementation Highlights:
|
||||
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
|
||||
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
|
||||
- Dynamically builds CSV headers based on selected options before processing begins.
|
||||
|
||||
### Use Case:
|
||||
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
|
||||
@@ -0,0 +1,27 @@
|
||||
## CSV Taxonomy Loader for OBITools4
|
||||
|
||||
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
|
||||
|
||||
### Key Features:
|
||||
- **Robust CSV Parsing**: Uses Go’s `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
|
||||
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
|
||||
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
|
||||
- **Taxonomy Construction**:
|
||||
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
|
||||
- Ensures existence of a root node; returns error otherwise.
|
||||
- **Metadata Extraction**:
|
||||
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
|
||||
- Logs key metadata for traceability.
|
||||
- **Scalable Design**:
|
||||
- Processes records line-by-line (memory-efficient).
|
||||
- Supports large datasets via streaming CSV reading.
|
||||
|
||||
### Input Format:
|
||||
CSV must contain exactly four columns (case-sensitive headers):
|
||||
- `taxid`: Unique taxon identifier.
|
||||
- `parent`: Parent taxonomic node ID (empty for root).
|
||||
- `scientific_name`: Binomial or descriptive name.
|
||||
- `taxonomic_rank`: e.g., *species*, *genus*.
|
||||
|
||||
### Output:
|
||||
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obiformats.WriterDispatcher`
|
||||
|
||||
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
|
||||
|
||||
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
|
||||
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
|
||||
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
|
||||
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
|
||||
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
|
||||
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
|
||||
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
|
||||
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
|
||||
|
||||
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
|
||||
@@ -0,0 +1,29 @@
|
||||
# EcoPCR File Parser for Biological Sequences
|
||||
|
||||
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
|
||||
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
|
||||
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
|
||||
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
|
||||
- Name (with deduplication support)
|
||||
- Nucleotide/protein sequence
|
||||
- Comment field
|
||||
- **Structured Annotation**: Populates rich annotations including:
|
||||
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
|
||||
- Primer matching info (`forward_match`, `reverse_mismatch`)
|
||||
- Melting temperatures (if present in v2)
|
||||
- Amplicon length and strand orientation
|
||||
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
|
||||
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- Custom line reader (`__readline__`) for robust header parsing.
|
||||
- CSV parser configured with `|` delimiter and comment support (`#`).
|
||||
- Deduplication of sequence names using a running count suffix.
|
||||
- Concurrent goroutine-based streaming to decouple I/O and processing.
|
||||
|
||||
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
|
||||
@@ -0,0 +1,17 @@
|
||||
# EMBL Format Parser for OBITools4
|
||||
|
||||
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
|
||||
|
||||
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
|
||||
- **Two Parsing Modes**:
|
||||
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
|
||||
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
|
||||
- **Configurable Options**:
|
||||
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
|
||||
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
|
||||
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
|
||||
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
|
||||
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
|
||||
- **Integration**: Outputs are compatible with OBITools4’s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
|
||||
|
||||
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
|
||||
@@ -0,0 +1,22 @@
|
||||
## `ReadEmptyFile` Function — Semantic Description
|
||||
|
||||
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
|
||||
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
|
||||
- **Signature**:
|
||||
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
|
||||
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
|
||||
- **Behavior**:
|
||||
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
|
||||
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
|
||||
- **Output**:
|
||||
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
|
||||
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
|
||||
|
||||
### Semantic Role & Use Cases
|
||||
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
|
||||
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
|
||||
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
|
||||
|
||||
### Design Notes
|
||||
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
|
||||
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
|
||||
@@ -0,0 +1,34 @@
|
||||
# FASTA Parser Module (`obiformats`)
|
||||
|
||||
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **`FastaChunkParser(UtoT bool)`**
|
||||
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
|
||||
|
||||
- **`FastaChunkParserRope(...)`**
|
||||
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
|
||||
|
||||
- **`ReadFasta(reader io.Reader, ...)`**
|
||||
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
|
||||
|
||||
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
|
||||
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
|
||||
|
||||
- **`EndOfLastFastaEntry(...)`**
|
||||
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
|
||||
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
|
||||
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
|
||||
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
|
||||
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
|
||||
- Graceful error reporting with context (source, identifier, invalid char position).
|
||||
- Extensible via `WithOption` pattern for header parsing and batching behavior.
|
||||
@@ -0,0 +1,41 @@
|
||||
# FASTQ Parsing Module (`obiformats`)
|
||||
|
||||
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **`EndOfLastFastqEntry(buffer []byte) int`**
|
||||
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
|
||||
|
||||
- **`FastqChunkParser(...)`**
|
||||
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
|
||||
- Header parsing (`@id [definition]`)
|
||||
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
|
||||
- Quality score shifting (`quality_shift`)
|
||||
- Strict validation (e.g., `+` line, matching sequence/length)
|
||||
|
||||
- **`FastqChunkParserRope(...)`**
|
||||
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
|
||||
|
||||
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
|
||||
Enables concurrent, chunked parsing of large files:
|
||||
- Splits input into chunks using `ReadFileChunk`
|
||||
- Uses configurable parallel workers (`nworker`)
|
||||
- Pushes parsed batches to an iterator interface
|
||||
|
||||
- **Convenience I/O Wrappers**
|
||||
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
|
||||
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
|
||||
|
||||
## Key Options & Features
|
||||
|
||||
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
|
||||
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
|
||||
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
|
||||
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
|
||||
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
|
||||
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
|
||||
@@ -0,0 +1,11 @@
|
||||
## Semantic Description of `obiformats` Package
|
||||
|
||||
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
|
||||
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
|
||||
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
|
||||
|
||||
Two main constructor functions enable flexible formatting:
|
||||
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
|
||||
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
|
||||
|
||||
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Semantic Description of `obiformats` Package
|
||||
|
||||
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
|
||||
|
||||
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
|
||||
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
|
||||
|
||||
## Core Functions
|
||||
|
||||
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
|
||||
Dynamically routes header parsing based on the first character of the sequence definition:
|
||||
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
|
||||
- Otherwise invokes `ParseFastSeqOBIHeader`.
|
||||
|
||||
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
|
||||
Applies header parsing to a *batch* of sequences:
|
||||
- Takes an iterator over `BioSequence`s.
|
||||
- Uses optional configuration (e.g., parallelism, parsing behavior).
|
||||
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Format agnosticism**: Automatically detects header type.
|
||||
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
|
||||
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
|
||||
|
||||
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
|
||||
@@ -0,0 +1,28 @@
|
||||
# `FormatHeader` Function Type in `obiformats`
|
||||
|
||||
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
|
||||
|
||||
- **Package**: `obiformats`
|
||||
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
|
||||
|
||||
- **Type Definition**:
|
||||
```go
|
||||
type FormatHeader func(sequence *obiseq.BioSequence) string
|
||||
```
|
||||
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
|
||||
|
||||
- **Semantic Role**:
|
||||
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
|
||||
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
|
||||
|
||||
- **Usage Context**:
|
||||
- Used by writers/formatters to produce standardized headers when exporting sequences.
|
||||
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
|
||||
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
|
||||
|
||||
- **Dependencies**:
|
||||
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
|
||||
|
||||
- **Design Intent**:
|
||||
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
|
||||
Facilitates extensibility for new output formats without modifying core types.
|
||||
@@ -0,0 +1,21 @@
|
||||
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
|
||||
|
||||
- **JSON Parsing Helpers**:
|
||||
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
|
||||
|
||||
- **Header Interpretation**:
|
||||
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
|
||||
- Core fields (`id`, `definition`, `count`)
|
||||
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
|
||||
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
|
||||
|
||||
- **Sequence Annotation Enrichment**:
|
||||
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
|
||||
|
||||
- **Serialization Support**:
|
||||
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
|
||||
|
||||
- **Error Handling**:
|
||||
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
|
||||
|
||||
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
|
||||
@@ -0,0 +1,31 @@
|
||||
# OBIFormats Package: Semantic Description
|
||||
|
||||
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
|
||||
|
||||
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
|
||||
- Three core parsing functions detect value types:
|
||||
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
|
||||
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
|
||||
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
|
||||
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
|
||||
|
||||
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
|
||||
|
||||
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
|
||||
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
|
||||
- Numeric values are stored as integers if they have no fractional part.
|
||||
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
|
||||
- `*_count` → `map[string]int`,
|
||||
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
|
||||
- `*_status`/`*_mutation` → `map[string]string`.
|
||||
|
||||
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequence’s definition line, moving annotations into its metadata map and preserving leftover text.
|
||||
|
||||
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
|
||||
- Strings and booleans use `key=value;`.
|
||||
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
|
||||
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
|
||||
|
||||
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
|
||||
|
||||
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
|
||||
@@ -0,0 +1,26 @@
|
||||
# FastSeq Reader Module — Semantic Description
|
||||
|
||||
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
|
||||
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
|
||||
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
|
||||
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
|
||||
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
|
||||
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
|
||||
- **Stdin & File I/O**: Two entry points:
|
||||
- `ReadFastSeqFromFile(filename, ...)` for regular files.
|
||||
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
|
||||
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
|
||||
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
|
||||
|
||||
## Integration
|
||||
|
||||
Built on top of `obitools4`’s core abstractions:
|
||||
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
|
||||
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
|
||||
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
|
||||
|
||||
Designed for scalability in high-throughput metabarcoding pipelines.
|
||||
@@ -0,0 +1,35 @@
|
||||
# `obiformats` Package Overview
|
||||
|
||||
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
|
||||
|
||||
## Core Formatting Functions
|
||||
|
||||
- **`FormatFasta(seq, formater)`**
|
||||
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
|
||||
|
||||
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
|
||||
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
|
||||
|
||||
## File Writing Functions
|
||||
|
||||
- **`WriteFasta(iterator, file, options...)`**
|
||||
Writes a stream of sequences to any `io.WriteCloser`. Supports:
|
||||
- Parallel workers (`ParallelWorkers`)
|
||||
- Chunked writing via `WriteFileChunk`
|
||||
- Optional compression (e.g., gzip)
|
||||
Returns a new iterator mirroring the input for pipeline chaining.
|
||||
|
||||
- **`WriteFastaToStdout(iterator, options...)`**
|
||||
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
|
||||
|
||||
- **`WriteFastaToFile(iterator, filename, options...)`**
|
||||
Writes to a named file with:
|
||||
- Truncation or append mode (`AppendFile`)
|
||||
- Automatic paired-end output if `HaveToSavePaired()` is enabled
|
||||
(writes reverse reads to a secondary file specified via `PairedFileName`)
|
||||
|
||||
## Key Design Highlights
|
||||
|
||||
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
|
||||
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
|
||||
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
|
||||
@@ -0,0 +1,35 @@
|
||||
# FASTQ Output Module (`obiformats`)
|
||||
|
||||
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
|
||||
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
|
||||
|
||||
## Header Customization
|
||||
|
||||
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
|
||||
|
||||
## Writing to Streams/Files
|
||||
|
||||
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
|
||||
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
|
||||
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
|
||||
- Append/truncate modes
|
||||
- Paired-end output (splits iterator and writes to two files)
|
||||
- Automatic compression via `obiutils.CompressStream`
|
||||
|
||||
## Parallelization & Robustness
|
||||
|
||||
- Uses goroutines to parallelize formatting/writing across multiple workers.
|
||||
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
|
||||
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
|
||||
|
||||
## Integration
|
||||
|
||||
Designed to work seamlessly with the `obitools4` ecosystem:
|
||||
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
|
||||
- Extensible through functional options (`WithOption`) for configuration.
|
||||
|
||||
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
|
||||
@@ -0,0 +1,19 @@
|
||||
# `obiformats` Package Overview
|
||||
|
||||
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
|
||||
|
||||
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
|
||||
|
||||
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
|
||||
|
||||
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
|
||||
|
||||
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
|
||||
|
||||
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
|
||||
|
||||
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
|
||||
|
||||
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
|
||||
|
||||
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
|
||||
@@ -0,0 +1,25 @@
|
||||
# Semantic Description of `obiformats` Package Functionalities
|
||||
|
||||
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
|
||||
|
||||
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
|
||||
|
||||
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
|
||||
|
||||
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
|
||||
|
||||
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
|
||||
|
||||
- **`ReadFileChunk()`**: Core function that:
|
||||
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
|
||||
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
|
||||
- Extends chunks incrementally (e.g., +1 MB) until a full record boundary is found via `splitter`;
|
||||
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
|
||||
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
|
||||
|
||||
- **Key semantics**:
|
||||
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
|
||||
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
|
||||
- *Streaming-first design* — supports large files without full loading into memory.
|
||||
|
||||
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
|
||||
@@ -0,0 +1,26 @@
|
||||
# `WriteFileChunk` Function — Semantic Description
|
||||
|
||||
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
|
||||
|
||||
- **Input**:
|
||||
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
|
||||
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
|
||||
|
||||
- **Core Behavior**:
|
||||
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
|
||||
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
|
||||
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
|
||||
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
|
||||
|
||||
- **Buffer Management**:
|
||||
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
|
||||
|
||||
- **Error Handling**:
|
||||
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
|
||||
|
||||
- **Cleanup & Lifecycle**:
|
||||
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
|
||||
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
|
||||
|
||||
- **Use Case**:
|
||||
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
|
||||
@@ -0,0 +1,34 @@
|
||||
# GenBank Parser Module (`obiformats`)
|
||||
|
||||
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
|
||||
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
|
||||
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
|
||||
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
|
||||
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
|
||||
- **Parallel streaming I/O**:
|
||||
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
|
||||
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
|
||||
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
|
||||
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
|
||||
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
|
||||
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
|
||||
|
||||
## Output
|
||||
|
||||
Returns a batched iterator of `BioSequence` objects, each containing:
|
||||
- Identifier (`id`)
|
||||
- Compact nucleotide sequence
|
||||
- Definition line (as description)
|
||||
- Source file origin
|
||||
- Optional feature table bytes
|
||||
- Annotations: `scientific_name`, `taxid`
|
||||
|
||||
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
|
||||
@@ -0,0 +1,27 @@
|
||||
# JSON Output Module for Biological Sequences (`obiformats`)
|
||||
|
||||
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
|
||||
|
||||
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
|
||||
- `"id"`: Sequence identifier.
|
||||
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
|
||||
- `"qualities"` (optional): Quality scores as a string if available.
|
||||
- `"annotations"` (optional): Metadata annotations map.
|
||||
|
||||
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
|
||||
|
||||
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
|
||||
- Parallel workers (configurable via options).
|
||||
- Automatic compression (`gzip`/`bgzip`) if enabled.
|
||||
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
|
||||
- Atomic ordering to preserve sequence integrity across parallel writes.
|
||||
|
||||
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
|
||||
- Outputs to stdout or a file (with append/truncate control).
|
||||
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
|
||||
|
||||
- **Internal helpers**:
|
||||
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9` → `\u00E9`).
|
||||
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
|
||||
|
||||
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
|
||||
@@ -0,0 +1,17 @@
|
||||
# NCBI Taxonomy Loader Module (`obiformats`)
|
||||
|
||||
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
|
||||
|
||||
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
|
||||
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
|
||||
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
|
||||
|
||||
Key features:
|
||||
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
|
||||
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
|
||||
- Efficient buffered reading (`bufio.Reader`) for large files.
|
||||
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
|
||||
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
|
||||
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
|
||||
|
||||
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
|
||||
@@ -0,0 +1,31 @@
|
||||
## NCBI Taxonomy Archive Support in `obiformats`
|
||||
|
||||
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
1. **Archive Validation (`IsNCBITarTaxDump`)**
|
||||
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
|
||||
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
|
||||
|
||||
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
|
||||
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
|
||||
- Steps include:
|
||||
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
|
||||
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
|
||||
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
|
||||
- Sets the root taxon to NCBI’s default (`taxid = 1`, i.e., *root*).
|
||||
|
||||
3. **Integration with Other Modules**
|
||||
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
|
||||
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
|
||||
|
||||
### Key Parameters
|
||||
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
|
||||
- `seqAsTaxa`: Reserved for future use; currently unused.
|
||||
|
||||
### Logging & Error Handling
|
||||
- Uses `logrus` to log loading progress and counts.
|
||||
- Returns descriptive errors if required files or the root taxon are missing.
|
||||
|
||||
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
|
||||
@@ -0,0 +1,31 @@
|
||||
# Newick Format Export Functionality in `obiformats`
|
||||
|
||||
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
|
||||
|
||||
## Core Components
|
||||
|
||||
- `Tree`: A struct modeling a node in a Newick tree, containing:
|
||||
- `Children`: list of child nodes (nested trees),
|
||||
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
|
||||
- `Length`: optional branch length (evolutionary distance).
|
||||
|
||||
- **`Newick()` methods**:
|
||||
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
|
||||
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
|
||||
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
|
||||
|
||||
- **Writing Functions**:
|
||||
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
|
||||
- Accepts an iterator over taxa (`*obitax.ITaxon`).
|
||||
- Validates single-taxonomy input.
|
||||
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
|
||||
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
|
||||
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
|
||||
|
||||
## Configuration Options
|
||||
|
||||
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
|
||||
|
||||
## Semantic Summary
|
||||
|
||||
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
|
||||
@@ -0,0 +1,47 @@
|
||||
# NGSFilter Configuration Parser — Semantic Overview
|
||||
|
||||
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Format Detection**:
|
||||
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
|
||||
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
|
||||
|
||||
- **Dual Input Parsing**:
|
||||
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
|
||||
- Primer pairs (`forward`, `reverse`)
|
||||
- Tag pairs (with optional `-` for untagged direction)
|
||||
- Experiment/sample metadata
|
||||
- OBIFeatures annotations (via `ParseOBIFeatures`)
|
||||
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
|
||||
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
|
||||
Additional columns are stored as annotations.
|
||||
|
||||
- **Parameter Configuration**:
|
||||
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
|
||||
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
|
||||
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
|
||||
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
|
||||
- Error tolerance:
|
||||
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
|
||||
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
|
||||
- Indel handling:
|
||||
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
|
||||
|
||||
- **Validation & Integrity Checks**:
|
||||
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
|
||||
- Duplicate tag-pair detection per marker (error on reuse).
|
||||
- Strict column/field validation with informative error messages.
|
||||
|
||||
- **Logging & Observability**:
|
||||
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Extensibility**: New parameters can be added via `library_parameter` map.
|
||||
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
|
||||
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
|
||||
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
|
||||
|
||||
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obiformats` Package Functionalities
|
||||
|
||||
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
|
||||
|
||||
Key capabilities include:
|
||||
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
|
||||
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
|
||||
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
|
||||
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
|
||||
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
|
||||
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
|
||||
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
|
||||
|
||||
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
|
||||
@@ -0,0 +1,27 @@
|
||||
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
|
||||
|
||||
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`newRopeScanner(rope *PieceOfChunk)`**
|
||||
Constructs a new scanner starting at the root of the rope.
|
||||
|
||||
- **`ReadLine() []byte`**
|
||||
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
|
||||
- Returns `nil` when the end of the rope is reached.
|
||||
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
|
||||
- The returned slice aliases rope data and is only valid until the next call.
|
||||
|
||||
- **`skipToNewline()`**
|
||||
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
|
||||
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
|
||||
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the rope’s underlying data.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
|
||||
@@ -0,0 +1,34 @@
|
||||
# Taxonomy Loading Module (`obiformats`)
|
||||
|
||||
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
|
||||
|
||||
## Core Features
|
||||
|
||||
1. **Format Detection**
|
||||
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
|
||||
- Supports:
|
||||
• NCBI Taxdump (both directory and `.tar` archive)
|
||||
• CSV files (`text/csv`)
|
||||
• FASTA/FASTQ sequences (via `mimetype` detection)
|
||||
|
||||
2. **Modular Loaders**
|
||||
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
|
||||
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
|
||||
|
||||
3. **Sequence-Based Taxonomy Extraction**
|
||||
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
|
||||
|
||||
4. **Integration with OBITools Ecosystem**
|
||||
- Leverages `obitax.Taxonomy` as the canonical output structure.
|
||||
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
|
||||
|
||||
5. **Error Handling & Logging**
|
||||
- Graceful failure with descriptive errors; informative logging via `logrus`.
|
||||
|
||||
## Usage Flow
|
||||
|
||||
```go
|
||||
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
|
||||
```
|
||||
|
||||
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
|
||||
@@ -0,0 +1,26 @@
|
||||
# OBIFORMATS Package: Semantic Description
|
||||
|
||||
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
|
||||
|
||||
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
|
||||
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
|
||||
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
|
||||
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
|
||||
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
|
||||
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
|
||||
- **CSV** (`text/csv`): generic tabular support.
|
||||
|
||||
Core functionality is exposed through:
|
||||
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
|
||||
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
|
||||
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
|
||||
|
||||
Internally leverages:
|
||||
- `obiutils.Ropen()` for unified file opening (including stdin handling).
|
||||
- Path extension stripping and source tagging via `OptionsSource()`.
|
||||
- Logging (`logrus`) for format diagnostics.
|
||||
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
|
||||
|
||||
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
|
||||
|
||||
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
|
||||
@@ -0,0 +1,29 @@
|
||||
# `obiformats` Package: Sequence Writing Utilities
|
||||
|
||||
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`WriteSequence()`**:
|
||||
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
|
||||
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
|
||||
- Preserves iterator state via `PushBack()` to allow chaining.
|
||||
|
||||
- **`WriteSequencesToStdout()`**:
|
||||
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
|
||||
|
||||
- **`WriteSequencesToFile()`**:
|
||||
Writes sequences to a specified file. Supports:
|
||||
- File creation/truncation or append mode (`OptionAppendFile()`).
|
||||
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
|
||||
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
|
||||
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
|
||||
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
|
||||
|
||||
## Integration
|
||||
|
||||
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
|
||||
Reference in New Issue
Block a user