⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,43 @@
# Semantic Description of `ReadSequencesBatchFromFiles`
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
## Core Functionality
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
## Concurrency Model
- Launches `concurrent_readers` goroutines to process files in parallel.
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
## Streaming Interface
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
## Error Handling & Logging
- Panics on file-open failure (via `log.Panicf`).
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
## Resource Management
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
## Design Intent
- Enables scalable, memory-efficient ingestion of large NGS datasets.
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
## Key Abstractions
| Type/Interface | Role |
|----------------|------|
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
@@ -0,0 +1,36 @@
# `obiformats` Package — Semantic Overview
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
## Core Abstraction
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
```go
func(string, ...WithOption) (obiiter.IBioSequence, error)
```
- It accepts:
- A file path (`string`)
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
- Returns:
- An iterator over biological sequences (`obiiter.IBioSequence`)
- Or an error if the file cannot be opened/parsed
## Semantic Intent
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
## Typical Usage Pattern
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
2. Call it with a file path and optional options.
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
## Design Principles
- **Functional, minimal API**: Single responsibility—reading and iteration.
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
+30
View File
@@ -0,0 +1,30 @@
# CSV Import Module for Biological Sequences (`obiformats`)
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
## Core Features
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
- **Metadata Handling**:
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
- **Multiple Entry Points**:
- `ReadCSV`: From any `io.Reader`.
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
- `ReadCSVFromStdin`: Reads from standard input.
- **Error & Edge Handling**:
- Gracefully handles empty files/streams via `ReadEmptyFile`.
- Uses structured logging (Logrus) for fatal and informational messages.
## Integration
Designed to integrate with OBItools4s core types:
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
## Use Case
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
# CSVSequenceRecord Function Description
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
## Core Features
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
## Design Highlights
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
- Handles missing data consistently via `opt.CSVNAValue()`.
- Supports both standard and user-defined metadata fields.
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
# `CSVTaxaIterator` Function — Semantic Description
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
### Core Functionality:
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
### Configurable Output Fields (via options):
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
- `scientific_name`: Full scientific name of the taxon.
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
### Implementation Highlights:
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
- Dynamically builds CSV headers based on selected options before processing begins.
### Use Case:
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
## CSV Taxonomy Loader for OBITools4
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
### Key Features:
- **Robust CSV Parsing**: Uses Gos `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
- **Taxonomy Construction**:
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
- Ensures existence of a root node; returns error otherwise.
- **Metadata Extraction**:
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
- Logs key metadata for traceability.
- **Scalable Design**:
- Processes records line-by-line (memory-efficient).
- Supports large datasets via streaming CSV reading.
### Input Format:
CSV must contain exactly four columns (case-sensitive headers):
- `taxid`: Unique taxon identifier.
- `parent`: Parent taxonomic node ID (empty for root).
- `scientific_name`: Binomial or descriptive name.
- `taxonomic_rank`: e.g., *species*, *genus*.
### Output:
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats.WriterDispatcher`
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
# EcoPCR File Parser for Biological Sequences
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
## Key Features
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
- Name (with deduplication support)
- Nucleotide/protein sequence
- Comment field
- **Structured Annotation**: Populates rich annotations including:
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
- Primer matching info (`forward_match`, `reverse_mismatch`)
- Melting temperatures (if present in v2)
- Amplicon length and strand orientation
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
## Implementation Highlights
- Custom line reader (`__readline__`) for robust header parsing.
- CSV parser configured with `|` delimiter and comment support (`#`).
- Deduplication of sequence names using a running count suffix.
- Concurrent goroutine-based streaming to decouple I/O and processing.
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
+17
View File
@@ -0,0 +1,17 @@
# EMBL Format Parser for OBITools4
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
- **Two Parsing Modes**:
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
- **Configurable Options**:
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
- **Integration**: Outputs are compatible with OBITools4s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
## `ReadEmptyFile` Function — Semantic Description
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
- **Signature**:
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
- **Behavior**:
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
- **Output**:
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
### Semantic Role & Use Cases
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
### Design Notes
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
# FASTA Parser Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
## Core Functionalities
- **`FastaChunkParser(UtoT bool)`**
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
- **`FastaChunkParserRope(...)`**
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
- **`ReadFasta(reader io.Reader, ...)`**
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
- **`EndOfLastFastaEntry(...)`**
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
## Key Features
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
## Design Highlights
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
- Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
# FASTQ Parsing Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
## Core Functionalities
- **`EndOfLastFastqEntry(buffer []byte) int`**
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
- **`FastqChunkParser(...)`**
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
- Header parsing (`@id [definition]`)
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
- Quality score shifting (`quality_shift`)
- Strict validation (e.g., `+` line, matching sequence/length)
- **`FastqChunkParserRope(...)`**
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
Enables concurrent, chunked parsing of large files:
- Splits input into chunks using `ReadFileChunk`
- Uses configurable parallel workers (`nworker`)
- Pushes parsed batches to an iterator interface
- **Convenience I/O Wrappers**
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
## Key Options & Features
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
## Design Highlights
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
## Semantic Description of `obiformats` Package
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
Two main constructor functions enable flexible formatting:
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
# Semantic Description of `obiformats` Package
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
## Core Functions
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
Dynamically routes header parsing based on the first character of the sequence definition:
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
- Otherwise invokes `ParseFastSeqOBIHeader`.
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
Applies header parsing to a *batch* of sequences:
- Takes an iterator over `BioSequence`s.
- Uses optional configuration (e.g., parallelism, parsing behavior).
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
## Design Principles
- **Format agnosticism**: Automatically detects header type.
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
# `FormatHeader` Function Type in `obiformats`
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
- **Package**: `obiformats`
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
- **Type Definition**:
```go
type FormatHeader func(sequence *obiseq.BioSequence) string
```
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
- **Semantic Role**:
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
- **Usage Context**:
- Used by writers/formatters to produce standardized headers when exporting sequences.
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
- **Dependencies**:
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
- **Design Intent**:
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
- **JSON Parsing Helpers**:
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
- **Header Interpretation**:
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
- Core fields (`id`, `definition`, `count`)
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
- **Sequence Annotation Enrichment**:
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
- **Serialization Support**:
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
- **Error Handling**:
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
# OBIFormats Package: Semantic Description
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
- Three core parsing functions detect value types:
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
- Numeric values are stored as integers if they have no fractional part.
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
- `*_count``map[string]int`,
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
- `*_status`/`*_mutation``map[string]string`.
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequences definition line, moving annotations into its metadata map and preserving leftover text.
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
- Strings and booleans use `key=value;`.
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
# FastSeq Reader Module — Semantic Description
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
## Core Features
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
- **Stdin & File I/O**: Two entry points:
- `ReadFastSeqFromFile(filename, ...)` for regular files.
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
## Integration
Built on top of `obitools4`s core abstractions:
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
# `obiformats` Package Overview
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
## Core Formatting Functions
- **`FormatFasta(seq, formater)`**
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
## File Writing Functions
- **`WriteFasta(iterator, file, options...)`**
Writes a stream of sequences to any `io.WriteCloser`. Supports:
- Parallel workers (`ParallelWorkers`)
- Chunked writing via `WriteFileChunk`
- Optional compression (e.g., gzip)
Returns a new iterator mirroring the input for pipeline chaining.
- **`WriteFastaToStdout(iterator, options...)`**
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
- **`WriteFastaToFile(iterator, filename, options...)`**
Writes to a named file with:
- Truncation or append mode (`AppendFile`)
- Automatic paired-end output if `HaveToSavePaired()` is enabled
(writes reverse reads to a secondary file specified via `PairedFileName`)
## Key Design Highlights
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
# FASTQ Output Module (`obiformats`)
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
## Core Functionality
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
## Header Customization
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
## Writing to Streams/Files
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
- Append/truncate modes
- Paired-end output (splits iterator and writes to two files)
- Automatic compression via `obiutils.CompressStream`
## Parallelization & Robustness
- Uses goroutines to parallelize formatting/writing across multiple workers.
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
## Integration
Designed to work seamlessly with the `obitools4` ecosystem:
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
- Extensible through functional options (`WithOption`) for configuration.
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
# `obiformats` Package Overview
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
# Semantic Description of `obiformats` Package Functionalities
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
- **`ReadFileChunk()`**: Core function that:
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
- Extends chunks incrementally (e.g., +1MB) until a full record boundary is found via `splitter`;
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
- **Key semantics**:
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
- *Streaming-first design* — supports large files without full loading into memory.
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
# `WriteFileChunk` Function — Semantic Description
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
- **Input**:
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
- **Core Behavior**:
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
- **Buffer Management**:
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
- **Error Handling**:
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
- **Cleanup & Lifecycle**:
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
- **Use Case**:
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
# GenBank Parser Module (`obiformats`)
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
## Core Functionalities
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
- **Parallel streaming I/O**:
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
## Key Design Decisions
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
## Output
Returns a batched iterator of `BioSequence` objects, each containing:
- Identifier (`id`)
- Compact nucleotide sequence
- Definition line (as description)
- Source file origin
- Optional feature table bytes
- Annotations: `scientific_name`, `taxid`
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
# JSON Output Module for Biological Sequences (`obiformats`)
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
- `"id"`: Sequence identifier.
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
- `"qualities"` (optional): Quality scores as a string if available.
- `"annotations"` (optional): Metadata annotations map.
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
- Parallel workers (configurable via options).
- Automatic compression (`gzip`/`bgzip`) if enabled.
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
- Atomic ordering to preserve sequence integrity across parallel writes.
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
- Outputs to stdout or a file (with append/truncate control).
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
- **Internal helpers**:
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9``\u00E9`).
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
# NCBI Taxonomy Loader Module (`obiformats`)
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
Key features:
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
- Efficient buffered reading (`bufio.Reader`) for large files.
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
## NCBI Taxonomy Archive Support in `obiformats`
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
### Core Functionalities
1. **Archive Validation (`IsNCBITarTaxDump`)**
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
- Steps include:
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
- Sets the root taxon to NCBIs default (`taxid = 1`, i.e., *root*).
3. **Integration with Other Modules**
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
### Key Parameters
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
- `seqAsTaxa`: Reserved for future use; currently unused.
### Logging & Error Handling
- Uses `logrus` to log loading progress and counts.
- Returns descriptive errors if required files or the root taxon are missing.
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
# Newick Format Export Functionality in `obiformats`
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
## Core Components
- `Tree`: A struct modeling a node in a Newick tree, containing:
- `Children`: list of child nodes (nested trees),
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
- `Length`: optional branch length (evolutionary distance).
- **`Newick()` methods**:
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
- **Writing Functions**:
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
- Accepts an iterator over taxa (`*obitax.ITaxon`).
- Validates single-taxonomy input.
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
## Configuration Options
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
## Semantic Summary
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
# NGSFilter Configuration Parser — Semantic Overview
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
## Core Functionality
- **Format Detection**:
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
- **Dual Input Parsing**:
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
- Primer pairs (`forward`, `reverse`)
- Tag pairs (with optional `-` for untagged direction)
- Experiment/sample metadata
- OBIFeatures annotations (via `ParseOBIFeatures`)
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
Additional columns are stored as annotations.
- **Parameter Configuration**:
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
- Error tolerance:
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
- Indel handling:
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
- **Validation & Integrity Checks**:
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
- Duplicate tag-pair detection per marker (error on reuse).
- Strict column/field validation with informative error messages.
- **Logging & Observability**:
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
## Design Highlights
- **Extensibility**: New parameters can be added via `library_parameter` map.
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats` Package Functionalities
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
Key capabilities include:
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
## Core Functionality
- **`newRopeScanner(rope *PieceOfChunk)`**
Constructs a new scanner starting at the root of the rope.
- **`ReadLine() []byte`**
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
- Returns `nil` when the end of the rope is reached.
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
- The returned slice aliases rope data and is only valid until the next call.
- **`skipToNewline()`**
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
## Implementation Highlights
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the ropes underlying data.
## Use Case
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
# Taxonomy Loading Module (`obiformats`)
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
## Core Features
1. **Format Detection**
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
- Supports:
• NCBI Taxdump (both directory and `.tar` archive)
• CSV files (`text/csv`)
• FASTA/FASTQ sequences (via `mimetype` detection)
2. **Modular Loaders**
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
3. **Sequence-Based Taxonomy Extraction**
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
4. **Integration with OBITools Ecosystem**
- Leverages `obitax.Taxonomy` as the canonical output structure.
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
5. **Error Handling & Logging**
- Graceful failure with descriptive errors; informative logging via `logrus`.
## Usage Flow
```go
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
```
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
# OBIFORMATS Package: Semantic Description
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
- **CSV** (`text/csv`): generic tabular support.
Core functionality is exposed through:
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
Internally leverages:
- `obiutils.Ropen()` for unified file opening (including stdin handling).
- Path extension stripping and source tagging via `OptionsSource()`.
- Logging (`logrus`) for format diagnostics.
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
# `obiformats` Package: Sequence Writing Utilities
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
## Core Functionality
- **`WriteSequence()`**:
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
- Preserves iterator state via `PushBack()` to allow chaining.
- **`WriteSequencesToStdout()`**:
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
- **`WriteSequencesToFile()`**:
Writes sequences to a specified file. Supports:
- File creation/truncation or append mode (`OptionAppendFile()`).
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
## Design Highlights
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
## Integration
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.