⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+22
View File
@@ -0,0 +1,22 @@
# BioSequence Attribute Management API
This Go package (`obiseq`) provides a rich set of methods for managing metadata and structural attributes associated with biological sequences (`BioSequence`). Below is a semantic overview of the core functionalities:
- **Key Discovery & Existence Checks**:
- `Keys()` and `AttributeKeys()` return all attribute names (optionally excluding container/statistics fields or the `"definition"` key).
- `HasAttribute(key)` verifies presence of a given attribute (including standard fields: `"id"`, `"sequence"`, `"qualities"`).
- **Generic Attribute Access**:
- `GetAttribute(key)` retrieves any attribute value (as `interface{}`), with thread-safe locking.
- `SetAttribute(key, value)` assigns values to attributes (including automatic conversion for `"id"`, `"sequence"` and `"qualities"`).
- **Typed Attribute Retrieval**:
- Type-specific getters (`GetIntAttribute`, `GetFloatAttribute`, `GetStringAttribute`, etc.) ensure safe conversion and *auto-upgrade* of stored values (e.g., string `"42"` → integer `42`).
- Supports maps (`GetIntMap`, `GetStringMap`) and slices (`GetIntSlice`).
- **Convenience & Domain-Specific Helpers**:
- `Count()` / `SetCount()`: manage observation frequency (default = 1).
- OBITag indexing: `OBITagRefIndex()` / `SetOBITagRefIndex()`, and geometry variants (`geomref`). Supports flexible input map types with dynamic conversion.
- Coordinate & landmark support: `GetCoordinate()` / `SetCoordinate()`, and `landmark_id`-based operations (`IsALandmark()`, `GetLandmarkID()`).
All methods are designed for robustness: they handle type conversions gracefully, use locking to ensure concurrency safety, and provide fallbacks (e.g., default count = 1). The API abstracts internal storage (`annotations` map) while exposing a clean, consistent interface for sequence annotation manipulation.
+41
View File
@@ -0,0 +1,41 @@
# BioSequence: A High-Performance Biological Sequence Representation
The `obiseq` package defines the `BioSequence` struct, a memory-efficient and thread-safe container for biological DNA sequences. Beyond raw sequence data (`[]byte`), it supports rich metadata and operations essential for NGS pipelines.
## Core Features
- **Metadata Fields**:
- `id`: Unique sequence identifier.
- `source`: Filename (without path/extension) of origin.
- `definition`: Optional descriptive text, stored in annotations.
- **Sequence & Quality Support**:
- Stores sequence as lowercase `[]byte` (normalized via in-place lowercasing).
- Quality scores (`Quality = []uint8`) with fallback to default Phred+40 values when missing.
- Methods for incremental writing (`Write`, `WriteByte`) and clearing.
- **Annotations & Features**:
- Generic `Annotation` map (`map[string]interface{}`) for flexible metadata.
- Thread-safe access via `annot_lock` mutex (explicit locking/unlocking methods).
- Raw feature table storage (`[]byte`, e.g., EMBL/GenBank features).
- **Biological Relationships**:
- `paired`: Pointer to mate/read-pair sequence.
- `revcomp`: Pointer to reverse-complement variant (lazy or precomputed).
- **Introspection & Utility**:
- `Len()`, `HasSequence()`, `Composition()` (nucleotide counts: a,c,g,t,o).
- MD5 checksums (`MD5()` and `MD5String()`) for deduplication.
- Memory footprint estimation (`MemorySize()`), critical for streaming/batching.
- **Efficiency Optimizations**:
- `NewBioSequenceOwning`/`TakeQualities`: Zero-copy slice adoption (caller must not reuse input).
- `Recycle()`: Reuses slices via pool-aware functions (`RecycleSlice`, etc.).
- Global counters track creation/destruction/in-memory sequences for diagnostics.
- **Safety & Compatibility**:
- Copy semantics via `Copy()` (deep copy of slices + annotations).
- Validation: `HasValidSequence` enforces allowed characters (`a-z`, `-`, `.`, `[`, `]`).
- Uses unsafe string conversion for quality ASCII output (Phred shift configurable via `obidefault`).
Designed for scalability in large-scale metabarcoding workflows (e.g., OBITools4), balancing performance, correctness, and extensibility.
@@ -0,0 +1,35 @@
# `obiseq` Package: Semantic Overview
The `obiseq` package provides a robust, thread-safe implementation of biological sequence objects in Go. It defines the core `BioSequence` type and associated utilities for handling nucleotide sequences (DNA/RNA), quality scores, annotations, features, memory management, and metadata operations.
### Core Functionalities
- **Construction & Initialization**
- `NewEmptyBioSequence(cap)` creates an empty sequence with optional preallocated capacity.
- `NewBioSequence(id, seq, def)` builds a basic sequence with ID (case-normalized), byte-level sequence (`[]byte`), and definition.
- `NewBioSequenceWithQualities(...)` extends the above with per-base quality scores (`[]byte` or `Quality`).
- **Accessors & Properties**
- `Id()`, `Definition()` return metadata fields.
- `Sequence()` returns the normalized (lowercase) sequence as a copy of internal bytes.
- `Len()` returns the length (number of bases).
- `String()` provides a human-readable sequence string.
- **Quality & Feature Support**
- `HasQualities()` checks if quality scores are present.
- `Qualities()`, `SetQualities(...)` manage per-base quality data (with fallback to default values).
- `Features()` retrieves optional feature annotations as a string.
- **Annotation System**
- `Annotations()`, `HasAnnotation()` allow inspection of arbitrary metadata (key-value map).
- Thread-safe via internal `sync.Mutex`, exposed through `AnnotationsLock()`.
- **Utility & Safety**
- `Recycle()` safely resets internal slices and annotations (enables object pooling). Handles nil receivers gracefully.
- `Copy()` performs deep copy of all fields, including annotations and locks (new mutex).
- `MD5()` computes the MD5 hash of the sequence bytes.
- **Analysis Methods**
- `Composition()` returns a nucleotide count map (`a`, `c`, `g`, `t`, and `'o'` for others), case-insensitive.
All operations are designed with performance, safety (nil-safety, copy semantics), and extensibility in mind—ideal for bioinformatics pipelines requiring immutable or pooled sequence handling.
@@ -0,0 +1,37 @@
# `obiseq` Package: BioSequence Collection Management
The `obiseq` package provides a high-performance, memory-efficient implementation for managing collections of biological sequences (`BioSequence`) in Go. Its core type is `BioSequenceSlice`, a slice of pointers to `BioSequence` objects, optimized for batch processing in metagenomic pipelines.
### Key Functionalities
- **Memory Pooling & Allocation Control**:
`NewBioSequenceSlice` and `MakeBioSequenceSlice` allow creating slices with optional capacity hints.
`EnsureCapacity(capacity)` dynamically grows the underlying slice while logging warnings or panicking on persistent allocation failures.
- **Efficient Element Management**:
- `Push(sequence)`: Appends a sequence to the end.
- `Pop()`: Removes and returns the last element (nil-safe).
- `Pop0()`: Efficiently removes and returns the first element.
- **Collection Metadata Queries**:
- `Len()`: Returns number of sequences in the slice.
- `Size()`: Computes total sequence length (summing all `.Len()`).
- `NotEmpty()`: Boolean check for non-empty collections.
- **Attribute Aggregation**:
`AttributeKeys(skip_map, skip_definition)` aggregates all attribute keys across sequences into a set—useful for schema inference or validation.
- **Sorting Capabilities**:
- `SortOnCount(reverse)`: Sorts by read count (descending/ascending).
- `SortOnLength(reverse)`: Sorts by sequence length.
- **Taxonomy Integration**:
`ExtractTaxonomy(taxonomy, seqAsTaxa)` builds or extends a taxonomic tree from sequence paths.
When `seqAsTaxa=true`, it injects pseudo-taxonomic labels for individual sequences (e.g., `OTU:SEQ0000012345 [seqID]@sequence`), enabling unified taxonomic/rarefaction workflows.
### Design Highlights
- Minimal allocations via manual slice management and `slices.Grow`.
- Explicit niling of popped elements to aid garbage collection.
- Integrated logging (via `logrus`) for allocation issues—critical in large-scale NGS data processing.
- Designed to support `BioSequenceBatch`, a higher-level abstraction for streaming or parallelizable sequence batches.
+32
View File
@@ -0,0 +1,32 @@
# BioSequence Classifier Module Overview
This Go package (`obiseq`) provides a flexible and thread-safe framework for classifying biological sequences using different strategies. Each classifier implements four core methods:
- `Code(sequence) int`: assigns an integer class to a sequence.
- `Value(k) string`: retrieves the original value (or representation) for class index *k*.
- `Reset()`: clears internal state.
- `Clone() *BioSequenceClassifier`: creates a fresh copy of the classifier.
## Supported Classifier Types
1. **`AnnotationClassifier(key, na)`**
Classifies sequences based on a single annotation field. Missing annotations default to `na`. Internally maps string values → integer codes via a thread-safe dictionary.
2. **`DualAnnotationClassifier(key1, key2, na)`**
Uses *two* annotation fields. Combines them (as JSON array) to form unique class identifiers, enabling multi-dimensional classification.
3. **`PredicateClassifier(predicate)`**
Binary classifier: returns `1` if the provided predicate function evaluates to true, else `0`. Useful for rule-based grouping (e.g., length > 200).
4. **`HashClassifier(size)`**
Assigns sequences to one of `size` buckets via CRC32 hash of the raw sequence. Deterministic and memory-efficient, but may cause collisions.
5. **`SequenceClassifier()`**
Unique class per *exact* sequence string (case-sensitive). Uses a lock-protected map to deduplicate and index sequences.
6. **`RotateClassifier(size)`**
Cyclic assignment: sequence *i* → class `i mod size`. No memoization; state resets only manually.
7. **`CompositeClassifier(...)`**
Combines multiple classifiers: concatenates their integer outputs (e.g., `"3:17:0"`) to form a composite class key. Enables layered or hierarchical classification.
All classifiers are immutable after creation (state is internal and synchronized), supporting concurrent use in pipelines.
+20
View File
@@ -0,0 +1,20 @@
# Semantic Description of `obiseq` Comparison Functions
The `obiseq` package provides utility functions for comparing biological sequence records (`*BioSequence`) based on different fields. These comparators are designed to support sorting, deduplication, or grouping operations in bioinformatics workflows.
- **`CompareSequence(a, b *BioSequence) int`**
Compares the raw nucleotide or amino acid sequences (`a.sequence`) lexicographically using `bytes.Compare`. Returns:
- `<0` if `a < b`,
- `0` if equal,
- `>0` if `a > b`.
- **`CompareQuality(a, b *BioSequence) int`**
Compares the base quality scores (`a.qualities`) lexicographically (as byte strings), following same semantics as above. Useful for sorting reads by quality profiles.
- **Commented-out `CompareAttributeBuilder(key string)`**
A planned higher-order function to generate custom comparators based on sequence attributes (e.g., `RG`, `NM`). It would:
- Extract attribute values using `.GetAttribute(key)`.
- Handle missing attributes (treat absent as "less than" present).
- Eventually support typed comparisons for ordered types.
These functions assume `BioSequence` implements a consistent internal structure with `.sequence []byte` and `.qualities []byte`. They enable flexible, field-based ordering in collections of sequencing records.
+28
View File
@@ -0,0 +1,28 @@
# Semantic Description of `obiseq` Expression-Based Workers
This module provides **expression-driven transformation workers** for biological sequence objects (`BioSequence`). It leverages a custom expression language (via `OBILang`) to dynamically compute values based on sequence metadata and content.
## Core Components
- **`Expression(expression string)`**:
Returns a function that evaluates the given expression in context. The evaluation scope includes:
- `annotations`: sequence annotations (metadata).
- `sequence`: the full `BioSequence` object itself.
- **`EditIdWorker(expression string)`**:
A sequence worker that updates the *ID* of a `BioSequence` by evaluating the expression.
- On success: sets `sequence.Id()` to string representation of result.
- On failure: logs and returns an error with context.
- **`EditAttributeWorker(key string, expression string)`**:
A sequence worker that sets a *custom attribute* (identified by `key`) on the sequence, using evaluated expression result.
- Supports arbitrary metadata enrichment.
- Errors are reported with sequence ID and failed expression.
## Use Cases
- Generate new IDs from annotation fields (e.g., `"gene_" + annotations["locus_tag"]`).
- Compute and store derived attributes (e.g., GC content, ORF length) as sequence metadata.
- Apply conditional logic or transformations across large sets of sequences in pipelines.
All workers conform to the `SeqWorker` interface, enabling composition and chaining.
+27
View File
@@ -0,0 +1,27 @@
# Semantic Description of `obiseq` Package
The `obiseq` package provides utilities for handling **IUPAC nucleotide ambiguity codes** in biological sequences.
## Core Components
- `_iupac`: A lookup table mapping lowercase ASCII letters (`a``z`) to numeric IUPAC nucleotide codes:
- `A=1`, `C=2`, `G=4`, `T/U=8` (standard bases)
- Ambiguous codes are bitwise OR combinations:
e.g., `R = A|G = 1+4=5`, `Y = C|T = 2+8=10`, etc.
- Invalid or non-nucleotide characters map to `0`.
## Key Functionality
### `SameIUPACNuc(a, b byte) bool`
Performs **case-insensitive comparison** of two nucleotide symbols using IUPAC ambiguity rules.
- Converts uppercase letters to lowercase via bitwise OR (`|= 32`).
- For valid nucleotides, checks if their IUPAC codes have **non-zero bitwise AND**:
- Returns `true` only if the symbols share at least one possible base.
*Example*: `'R' & 'A' → (5 & 1) = 1 > 0 ⇒ true`
`'Y' & 'G' → (10 & 4) = 0 ⇒ false`
- For non-IUPAC or invalid characters, falls back to exact equality (`a == b`).
## Use Case
Enables robust comparison of DNA/RNA sequences where ambiguity codes (e.g., `N`, `R`, `W`) are used—critical for alignment, variant calling, or primer design tools.
+35
View File
@@ -0,0 +1,35 @@
# `obiseq` Package: Sequence Concatenation via `.Join()`
The `BioSequence.Join()` method enables semantic concatenation of two biological sequences (e.g., DNA, RNA, or protein strings).
- **Signature**:
```go
func (sequence *BioSequence) Join(seq2 *BioSequence, inplace bool) *BioSequence
```
- **Purpose**:
Combines the current sequence (`sequence`) with a second one (`seq2`), returning a new or modified `BioSequence`.
- **Parameters**:
- `seq2`: The sequence to append. Must be a valid `*BioSequence`.
- `inplace`: Boolean flag: if `true`, modifies the receiver in-place; otherwise, operates on a copy.
- **Semantics**:
- If `inplace == false`, the method first creates a deep copy of the original sequence to avoid side effects.
- It then appends `seq2.Sequence()` (the underlying string/byte representation) to the target sequence using an internal `.Write()` method.
- The final concatenated result is returned as a `*BioSequence`.
- **Behavioral Guarantees**:
- *Pure operation*: When `inplace = false`, the original sequences remain unaltered.
- *Chaining-friendly*: Returns a pointer, enabling method chaining (e.g., `seq.Join(a, false).Join(b, true)`).
- **Use Cases**:
- Building multi-domain proteins or gene fusions.
- Merging fragments from sequencing reads.
- Constructing synthetic constructs in silico.
- **Assumptions**:
- `BioSequence.Sequence()` returns a valid string/byte slice.
- `.Write(...)` handles appending correctly (e.g., no validation of biological compatibility — e.g., frame shifts are not checked).
This method supports flexible, functional-style sequence manipulation while preserving memory safety via optional in-place mutation.
+20
View File
@@ -0,0 +1,20 @@
## BioSequence.Kmers(k int) — Semantic Description
The `Kmers` method is a generator function that yields all contiguous *k*-length subsequences (called **k-mers**) from a biological sequence (`BioSequence`).
- It operates on `[]byte` data, assuming the underlying sequence is stored as a byte slice (e.g., DNA bases `A`, `C`, `G`, `T`).
- Uses Gos new iterator protocol (`iter.Seq[[]byte]`) for memory-efficient, lazy evaluation.
- Validates input: returns an empty iterator if `k ≤ 0` or exceeds sequence length.
- Iterates linearly from index `i = 0` to `len(seq) - k`, extracting slices of length *k*.
- Each yielded value is a **non-copying slice view** (efficient, but mutable if original data changes).
- Supports early termination: the consumer can stop iteration by returning `false` from the yield callback.
- Designed for downstream tasks like sequence analysis, motif discovery, or hashing (e.g., in k-mer counting).
- Does *not* handle reverse-complement or ambiguous bases—assumes raw sequence input.
Usage example:
```go
for kmer := range seq.Kmers(3) {
fmt.Printf("%s\n", string(kmer))
}
```
This yields all 3-mers (e.g., `"ACG"`, `"CGT"`...) in order.
+41
View File
@@ -0,0 +1,41 @@
# Semantic Description of `obiseq` Language Extensions
The `package obiseq` extends the [Gval](https://github.com/PaesslerAG/gval) expression language with domain-specific functions tailored for bioinformatics and data processing. It integrates utility helpers from `obiutils` to provide type-flexible, robust operations over sequences and collections.
## Core Functionalities
- **Data Inspection**:
`len`, `ismap`, `isvector` — retrieve size and type information.
- **Aggregation & Comparison**:
`min`, `max` — compute extremal values in slices/maps (via `obiutils.Min/Max`).
*(Note: commented-out helper functions suggest prior attempts at manual implementations.)*
- **Type Conversion**:
`int`, `numeric` (→ float64), `bool`, `string` — safely coerce arbitrary inputs to target types; fail with fatal logs on invalid data.
- **String Manipulation**:
`sprintf`, `subspc` (replace spaces with underscores), `replace` (regex-based substitution), and `substr` — support formatting, normalization, and slicing.
- **Sequence Analysis (Bioinformatics)**:
`gc`, `gcskew`, and `composition` — compute nucleotide composition metrics for DNA/RNA sequences (`BioSequence`).
- `gc`: GC content ratio (excluding ambiguous bases `'o'`)
- `gcskew`: `(GC)/(G+C)` asymmetry measure
- `composition`: returns a map of base counts (e.g., `"a":20.0`, `"g":15.0`)
- **Element Access**:
`elementof(seq, idx)` — retrieves item at index/key for slices (`[]interface{}`), maps (`map[string]interface{}`), or strings (by byte position).
- **Control Flow**:
`ifelse(cond, then_val, else_val)` — conditional branching within expressions.
- **Quality Support**:
`qualities(seq)` — extracts per-base quality scores as a float slice from sequencing reads.
## Design Principles
- **Dynamic Typing**: Accepts `...interface{}` arguments for flexibility.
- **Error Handling**: Uses fatal logging (`log.Fatalf`) on conversion failures; returns typed errors for runtime issues.
- **Extensibility**: Built atop `gval.Language`, enabling custom expression evaluation in pipelines (e.g., filtering reads via GC thresholds).
This package serves as a bridge between high-level scripting and low-level biosequence computation, ideal for rule-based filtering or annotation in NGS workflows.
+39
View File
@@ -0,0 +1,39 @@
# Semantic Description of `obiseq` Statistics and Merging Features
This package provides infrastructure for **tracking, aggregating, and merging statistical occurrences** of sequence attributes across biological sequences (`BioSequence`). It supports both **count-based and weighted statistics**, with thread-safe operations.
## Core Components
- `StatsOnValues`: A concurrent map (`map[string]int`) with R/W locking to store occurrence counts per attribute value (e.g., taxon, primer, quality bin).
- `StatsOnDescription`: Defines *how* to extract and weight statistics from a sequence (e.g., count per read, or sum of quality scores).
- `StatsOnSlotName(key)`: Generates internal annotation keys (e.g., `"merged_taxon"`) to store precomputed statistics.
## Key Functionalities
1. **Per-Sequence Statistics Initialization & Update**
- `StatsOn(desc, na)`: Ensures a statistics slot exists for attribute `desc.Key`, initializes if needed.
- `StatsPlusOne(...)`: Adds contribution of a *single* sequence to the statistics (e.g., increment count for its taxon).
2. **Thread-Safe Aggregation**
- `Merge(*StatsOnValues)`: Safely merges counts from another `StatsOnValues`, used to combine per-sequence stats.
3. **Sequence Merging with Stat Propagation**
- `BioSequence.Merge(...)`:
- Combines two sequences (e.g., consensus/overlap).
- Updates statistics for specified attributes (`statsOn`), preserving or aggregating counts.
- Resolves conflicting annotations by deleting non-merged fields if mismatched.
4. **Bulk Merging**
- `BioSequenceSlice.Merge(...)`: Efficiently merges *N* sequences into one, recycling inputs and updating statistics incrementally.
## Use Cases
- Tracking taxonomic assignments across merged reads.
- Aggregating primer or barcode counts in amplicon merging.
- Summarizing quality scores, abundance weights, or custom metadata during consensus building.
## Design Notes
- Uses `sync.RWMutex` for safe concurrent access.
- Supports only JSON-marshalable, serializable statistics (via `MarshalJSON`).
- Enforces type safety: only strings/integers/booleans allowed for attribute values.
+19
View File
@@ -0,0 +1,19 @@
# BioSequence Pairing Functionality
This package provides semantic tools for managing biological sequence pairings—typically used in genomics (e.g., paired-end reads). Key features:
- **Single-sequence pairing**:
- `IsPaired()` checks if a sequence is currently paired.
- `PairedWith()` returns the linked partner, or `nil`.
- `PairTo(p)` establishes a bidirectional link between two sequences.
- `UnPair()` safely severs the pairing on both ends.
- **Batch (slice) handling**:
- `IsPaired()` and `UnPair()` operate uniformly across all sequences in a slice.
- `PairedWith()` returns the corresponding paired slice (element-wise).
- `PairTo(p)` enforces length compatibility and pairs sequences index-by-index.
- **Error handling**:
- Mismatched slice lengths during `PairTo` trigger a fatal log (via Logrus), preventing inconsistent pairings.
Semantically, the API supports both *atomic* and *bulk* pairing operations while preserving consistency through bidirectional references—ideal for processing paired-end sequencing data.
+34
View File
@@ -0,0 +1,34 @@
# Semantic Overview of `obiseq` Package Functionalities
This Go package (`obiseq`) provides memory-efficient utilities for managing slices and annotations—key data structures in biosequence processing.
## Slice Management
- **`GetSlice(capacity int) []byte`**
Retrieves a reusable `[]byte` with ≥ requested capacity. For capacities ≤1024 bytes, it pulls from a `sync.Pool` (`_BioSequenceByteSlicePool`). Larger slices are freshly allocated.
- **`RecycleSlice(s *[]byte)`**
Clears and recycles small slices (≤1024 bytes) back to the pool. For large slices (≥100 KB), it nils them and triggers explicit `runtime.GC()` every ~256 MB of discarded memory to prevent heap bloat.
- **`CopySlice(src []byte) []byte`**
Efficiently copies a source slice into a pooled or newly allocated destination, preserving semantics without unnecessary allocations.
## Annotation Management
- **`BioSequenceAnnotationPool`**
A `sync.Pool` for reusable map-based annotations (`map[string]string`, inferred from usage), initialized with capacity 1.
- **`GetAnnotation(values ...Annotation) Annotation`**
Fetches an annotation map from the pool, optionally pre-populated via shallow copy of input annotations using `obiutils.MustFillMap`.
- **`RecycleAnnotation(a *Annotation)`**
Clears all keys from an annotation map and returns it to the pool for reuse.
## Design Rationale
The package prioritizes low-latency, high-throughput scenarios (e.g., NGS data pipelines) by minimizing GC pressure via:
- Tiered pooling strategy (`small` vs `large`)
- Explicit garbage collection triggers for large-object churn
- Safe reuse patterns avoiding aliasing or stale references
All operations are thread-safe via `sync.Pool` and atomic counters.
+33
View File
@@ -0,0 +1,33 @@
# Sequence Predicate Framework in `obiseq`
This Go package provides a flexible and composable predicate system for filtering biological sequences (`BioSequence`) based on diverse criteria.
## Core Concepts
- **`SequencePredicate`**: A function type `func(*BioSequence) bool`, enabling conditional logic on sequences.
- **Predicate Composition**: Supports logical operations (`And`, `Or`, `Xor`, `Not`) and chaining.
- **Paired-end Support**: Predicates can be adapted to consider read pairs via `PredicateOnPaired` and `PairedPredicat`, with modes:
- `ForwardOnly`: Only the forward read is evaluated.
- `ReverseOnly`, `And`, `Or`, `AndNot`, `Xor`: Combine forward and reverse evaluations.
## Built-in Predicates
| Predicate | Description |
|-----------|-------------|
| `HasAttribute(name)` | Checks if a sequence has an annotation with the given name. |
| `IsAttributeMatch(name, pattern)` | Tests if a named annotation matches the provided regex (case-sensitive). |
| `IsMoreAbundantOrEqualTo(count)` / `IsLessAbundantOrEqualTo(count)` | Filters by sequence abundance (count field). |
| `IsLongerOrEqualTo(length)` / `IsShorterOrEqualTo(length)` | Filters by sequence length. |
| `OccurInAtleast(sample, n)` | Checks if the sequence appears in at least *n* samples (via description stats). |
| `IsSequenceMatch(pattern)` | Matches the raw sequence against a regex (case-insensitive). |
| `IsDefinitionMatch(pattern)` | Matches the definition/description line against a regex. |
| `IsIdMatch(pattern)` / `IsIdIn(ids...)` | Filters by sequence ID using regex or explicit set. |
| `ExpressionPredicat(expression)` | Evaluates a custom boolean expression (via OBILang) using annotations and sequence metadata. |
## Design Highlights
- **Null-safe**: `nil` predicates are handled gracefully in compositions.
- **Extensible**: Custom predicates can be defined and combined seamlessly.
- **Logging & Safety**: Invalid regex patterns or expression syntax trigger fatal errors; runtime evaluation issues emit warnings.
This framework enables powerful, declarative filtering pipelines for high-throughput sequencing data analysis.
+35
View File
@@ -0,0 +1,35 @@
# BioSequence Reverse Complement Functionality
This Go package (`obiseq`) provides utilities for computing the reverse complement of biological sequences (e.g., DNA), including support for quality scores and structured metadata.
## Core Functions
- **`nucComplement(n byte) byte`**
Returns the nucleotide complement using a lookup table (`_revcmpDNA`). Handles special cases:
- `.` / `-` → unchanged (gaps)
- `[`, `]` → swapped (`[``]`)
- AZ letters → complemented (case-insensitive via bitwise masking)
- Unknown characters → `'n'`
- **`BioSequence.ReverseComplement(inplace bool) *BioSequence`**
Performs reverse complement on the sequence and (if present) its quality string:
- If `inplace = false`, a copy is made; original preserved.
- Reverses indices and complements each base using `nucComplement`.
- Also reverses the quality array symmetrically.
- Caches result in `sequence.revcomp` for reuse.
- **`BioSequence._revcmpMutation() *BioSequence`**
Adjusts mutation metadata (e.g., `"pairing_mismatches"`) to reflect the reversed-complement orientation:
- Reverses and complements symbolic mutation strings (e.g., `"A>T"``"T>A"`).
- Updates positional indices to match reversed sequence coordinates.
- **`ReverseComplementWorker(inplace bool) SeqWorker`**
Returns a reusable `SeqWorker` function for batch processing: applies reverse complement to each sequence in a stream.
## Design Notes
- Uses ASCII bitwise tricks (`&31`, `|0x20`) for case-insensitive indexing and lowercase output.
- Supports non-standard symbols (e.g., IUPAC ambiguity codes via lookup table).
- Integrates quality scores and structured attributes seamlessly.
> Ideal for NGS preprocessing pipelines where orientation matters (e.g., paired-end alignment, variant calling).
+19
View File
@@ -0,0 +1,19 @@
## Semantic Description of `obiseq` Package Functionality
The `obiseq` package provides core bioinformatics utilities for nucleic acid sequence manipulation in Go. It centers around two key operations:
- **Nucleotide Complementation (`nucComplement`)**
Implements standard Watson-Crick base pairing rules: `A↔T`, `C↔G`. It also handles ambiguous or symbolic characters (e.g., `'n' → 'n'`, `'[ ↔ ]'`), preserving non-standard symbols like gaps (`'-'`) and missing data (`'.'`). This function serves as the atomic building block for reverse-complement logic.
- **Reverse Complementation (`BioSequence.ReverseComplement`)**
A method on the `BioSequence` type that returns a new (or in-place modified) sequence representing:
- The *reverse* of the original nucleotide string, followed by
- Each base replaced with its complement (via `nucComplement`).
The method supports two modes:
- **Non-destructive (`inplace=false`)**: Returns a new `BioSequence`, leaving the original unchanged.
- **In-place (`inplace=true`)**: Modifies and returns the same object for memory efficiency.
Crucially, it preserves associated quality scores (e.g., Phred-scaled sequencing qualities), reversing their order to match the reversed sequence—ensuring correctness in downstream analyses like alignment or variant calling.
Tests validate both functions across edge cases: degenerate bases, ambiguous symbols, and quality-aware sequences—confirming robustness for typical NGS (Next-Generation Sequencing) workflows.
+13
View File
@@ -0,0 +1,13 @@
# `obiseq.Subsequence` Functionality Overview
The `Subsequence()` method extracts a contiguous segment from a biological sequence (`BioSequence`), supporting both linear and circular topologies.
- **Input validation**: Checks ensure `from < to` (unless circular), positions are non-negative, and bounds respect sequence length.
- **Circular handling**: Positions exceeding the sequence length wrap around using modular arithmetic; debug logs record corrections.
- **Linear extraction**: When `from < to`, it slices the underlying nucleotide/peptide sequence and, if present, its quality scores.
- **Circular extraction**: When `from > to`, it concatenates two linear segments: from `from` → end, and start → `to`.
- **Metadata preservation**: Quality scores (if available) and annotations are copied to the new subsequence.
- **ID formatting**: The resulting sequence ID is suffixed with `[from..to]` (1-based indexing).
- **Mutation tracking**: A private `_subseqMutation()` adjusts stored pairing mismatch positions by subtracting the extraction shift, ensuring coordinate consistency post-extraction.
This enables robust subsequence generation for genomic analysis workflows involving circular genomes (e.g., plasmids) or fragmented reads.
+29
View File
@@ -0,0 +1,29 @@
# `obiseq` Package: Subsequence Extraction Functionality
The `Subsequence()` method enables extraction of a contiguous segment from biological sequence data (`BioSequence`). It supports both linear and circular (wrapped) slicing.
- **Input Parameters**:
- `from`, `to`: 0-based inclusive indices defining the slice range.
- `circular`: boolean flag enabling wrap-around when `from > to`.
- **Behavior**:
- For linear (`circular = false`), `from ≤ to`, and indices within bounds `[0, len(seq))`.
- For circular (`circular = true`), allows wrap-around (e.g., `from=3, to=2` on a 4-mer yields indices `[3,0,1]`).
- Validates inputs: returns descriptive errors for:
- `from > to` (non-circular),
- out-of-bounds indices (`< 0` or `≥ length`),
- invalid ranges.
- **Quality Support**:
- When sequence includes base quality scores (`BioSequenceWithQualities`), the method preserves corresponding sub-slice of `Quality[]`.
- **Return Value**:
- Returns a new `BioSequence` (or subclass) instance containing the extracted subsequence and its optional qualities.
- **Use Case**:
- Ideal for region-of-interest extraction (e.g., primer binding sites, domain segments), especially in circular genomes or plasmids.
- **Testing**:
- Unit tests (`TestSubsequence`) cover valid/invalid inputs, circular/non-circular modes, and quality consistency.
This functionality provides robust, semantics-aware slicing for biosequence manipulation in Go.
@@ -0,0 +1,26 @@
# Taxonomic Classification via `TaxonomyClassifier`
The `obiseq` package provides a taxonomic classification mechanism through the `TaxonomyClassifier` function.
- **Purpose**: Constructs a reusable classifier for biological sequences based on taxonomic hierarchy.
- **Inputs**:
- `taxonomicRank`: Target rank (e.g., `"species"`, `"genus"`).
- `taxonomy`: Reference taxonomy (`*obitax.Taxonomy`), with fallback via `.OrDefault(true)`.
- `abortOnMissing`: Boolean flag to enforce strict taxon resolution.
- **Core Logic**:
- For each sequence, retrieves its `Taxon`, then drills down to the requested rank using `.TaxonAtRank()`.
- If `abortOnMissing` is true, exits on failure to resolve the taxon or rank.
- Internally maps `*TaxNode`s to integer codes for efficient storage/comparison.
- **Returned Object (`BioSequenceClassifier`)**:
- `Code(sequence) int`: Assigns a unique integer code to the taxonomic assignment of a sequence.
- `Value(code) string`: Returns the scientific name corresponding to a code.
- `Reset()`: Reinitializes internal mappings (useful for batch processing).
- `Clone() *BioSequenceClassifier`: Creates a fresh, identical classifier instance.
- **Design Rationale**:
- Uses integer codes to avoid repeated string operations and enable fast indexing (e.g., for counting).
- Supports both strict (`abortOnMissing=true`) and lenient classification modes.
This design enables scalable, efficient taxonomic profiling of sequencing datasets.
+22
View File
@@ -0,0 +1,22 @@
# Taxonomic Analysis Functions in `obiseq` Package
This module provides tools for assigning taxonomic labels to biological sequences using a reference taxonomy.
- **`TaxonomicDistribution(taxonomy)`**:
Returns a map from taxonomic nodes to read counts, based on `taxid` annotations in the sequence metadata. It validates taxids against the taxonomy and enforces strict handling of aliases.
- **`LCA(taxonomy, threshold)`**:
Computes the *Lowest Common Ancestor* (LCA) of all taxonomic assignments for a sequence, weighted by their abundances.
- Iteratively traverses upward from each taxons path in the taxonomy tree.
- At each level, computes the relative weight (`rmax`) of the most frequent taxon.
- Stops when `rmax < threshold`, returning:
• the LCA taxon,
• its confidence score (`rans`), and
• total read count used.
- **`AddLCAWorker(...)`**:
Creates a `SeqWorker` function to annotate sequences with LCA results:
- Sets attributes like `<slot>_taxid`, `<slot>_name`, and `<slot>_error` (rounded to 3 decimals).
- Automatically appends `_taxid` if missing in `slot_name`.
All functions integrate with the OBITools4 ecosystem, supporting robust taxonomic inference for metabarcoding workflows.
@@ -0,0 +1,41 @@
# Taxonomic Annotation Features in `obiseq` Package
This package provides semantic taxonomic annotation capabilities for biological sequences (`BioSequence`). It integrates with a taxonomy database to assign, retrieve, and manage taxonomic identifiers (taxids) and related metadata.
## Core Functions
- **`Taxid()`**: Retrieves the taxonomic ID as a string (e.g., `"12345"` or `"NA"`), supporting multiple internal representations (`string`, `int`, `float64`). Returns `"NA"` if no taxid is set.
- **`Taxon(taxonomy)`**: Returns the corresponding `*obitax.Taxon` object, or `nil` if taxid is `"NA"`.
- **`SetTaxid(taxid, rank...)`**: Assigns a taxonomic ID to the sequence. Validates against default taxonomy; handles aliases and errors based on configuration flags (`FailOnTaxonomy`, `UpdateTaxid`). Optionally stores taxid under a custom rank (e.g., `"genus_taxid"`).
- **`SetTaxon(taxon, rank...)`**: Assigns a `*obitax.Taxon` object directly; stores its string representation as taxid.
## Rank-Specific Annotation
- **`SetTaxonAtRank(taxonomy, rank)`**: Annotates the sequence with taxid and scientific name at a specified Linnaean rank (e.g., `"species"`, `"genus"`). Sets two attributes: `rank_taxid` and `rank_name`. Returns the taxon at that rank (or `nil`).
- **Convenience wrappers**:
- `SetSpecies(...)`
- `SetGenus(...)`
- `SetFamily(...)`
All delegate to `SetTaxonAtRank`.
## Taxonomic Path & Metadata
- **`SetPath(taxonomy)`**: Computes and stores the full taxonomic lineage (from root to species) as a string slice under attribute `"taxonomic_path"`.
- **`Path()`**: Retrieves the stored taxonomic path; recomputes it if missing and a default taxonomy exists.
- **`SetScientificName(taxonomy)`**: Stores the sequences species-level scientific name under `"scientific_name"`.
- **`SetTaxonomicRank(taxonomy)`**: Stores the taxons rank (e.g., `"species"`, `"genus"`) under `"taxonomic_rank"`.
## Error Handling & Configuration
- Uses `logrus` and custom logging (`obilog`) for warnings/errors.
- Behavior on taxonomy mismatches (e.g., unknown taxid, alias) is configurable via `obidefault` settings.
- Ensures type consistency: taxid must be string, int, or float; invalid types trigger fatal errors.
All methods are designed for seamless integration into bioinformatics pipelines, enabling robust taxonomic profiling of sequencing data.
@@ -0,0 +1,20 @@
# Semantic Description of `obiseq` Package Functionalities
This Go package provides **sequence filtering predicates** for biological sequences, integrated with taxonomic validation and hierarchy analysis.
- `IsAValidTaxon(taxonomy, ...bool) SequencePredicate`:
Returns a predicate that checks whether a sequence has an associated valid taxon in the given taxonomy.
Optionally supports *auto-correction* of outdated/incorrect `taxid` values to match the current taxonomy node.
- `IsSubCladeOf(taxonomy, parent) SequencePredicate`:
Filters sequences whose taxonomic assignment is a descendant (sub-clade) of the specified `parent` taxon.
- `IsSubCladeOfSlot(taxonomy, key) SequencePredicate`:
Enables filtering based on a *sequence attribute* (e.g., `"taxon"` or `"classification"`) that holds a taxonomic label.
Validates the label against the taxonomy, then checks if the sequences assigned taxon falls under it.
- `HasRequiredRank(taxonomy, rank) SequencePredicate`:
Ensures the sequences taxon is assigned at or below a specified rank (e.g., `"species"`, `"genus"`).
Validates the requested `rank` against taxonomys rank list; exits on invalid input.
All predicates follow a functional, composable design pattern (`SequencePredicate = func(*BioSequence) bool`), enabling flexible pipeline construction (e.g., filtering, classification validation).
@@ -0,0 +1,22 @@
# Taxonomic Annotation Workers in `obiseq`
This Go package provides functional workers for annotating biological sequences with taxonomic information using a hierarchical taxonomy (e.g., from NCBI or UNITE). Each worker is implemented as a `SeqWorker`—a function that processes one sequence and returns an updated slice of sequences.
- **`MakeSetTaxonAtRankWorker(taxonomy, rank)`**:
Assigns a taxonomic label at *a specific rank* (e.g., `"genus"`, `"family"`). Validates that the requested `rank` exists in the taxonomy before proceeding.
- **`MakeSetSpeciesWorker(taxonomy)`**:
Annotates each sequence with its inferred species name using the provided taxonomy.
- **`MakeSetGenusWorker(taxonomy)`**:
Adds genus-level taxonomic assignment to sequences.
- **`MakeSetFamilyWorker(taxonomy)`**:
Adds family-level taxonomic assignment.
- **`MakeSetPathWorker(taxonomy)`**:
Populates the full taxonomic path (e.g., `"Eukaryota;Metazoa;Chordata;..."`) for each sequence.
All workers rely on methods of `BioSequence` (e.g., `.SetSpecies()`, `.SetPath()`), which internally use the `obitax.Taxonomy` object to resolve taxonomic IDs or names. Errors are logged via `logrus`; invalid ranks cause a fatal exit.
These utilities support modular, pipeline-friendly taxonomic annotation—ideal for high-throughput metabarcoding workflows.
+18
View File
@@ -0,0 +1,18 @@
# Semantic Description of `obiseq` Package Functionalities
The `obiseq` package provides composable, higher-order worker functions for processing biological sequence data in Go. It defines three core functional types:
- `SeqAnnotator`: In-place annotation of a single sequence (e.g., adding metadata).
- `SeqWorker`: Processes one sequence and returns zero or more output sequences (1→N transformation).
- `SeqSliceWorker`: Processes a slice of sequences and returns another slice (bulk pipeline stage).
Key utilities include:
- **`NilSeqWorker`**: Identity worker—returns the input sequence unchanged.
- **`AnnotatorToSeqWorker`**: Converts an in-place annotator into a `SeqWorker`, preserving compatibility with pipeline interfaces.
- **`SeqToSliceWorker`**: Lifts a `SeqWorker` to operate on slices, with configurable error handling (`breakOnError`). Supports dynamic slice growth and logging via `obilog`.
- **`SeqToSliceFilterOnWorker`**: Filters sequences in a slice using a `SequencePredicate`, preserving order and avoiding unnecessary allocations.
- **`SeqToSliceConditionalWorker`**: Applies a `SeqWorker` only to sequences satisfying a predicate; others pass through unchanged.
- **`.ChainWorkers()`**: Method on `SeqWorker` to compose two workers sequentially (pipeline chaining), enabling modular, reusable workflows.
All functions emphasize safety: errors are either propagated (`breakOnError = true`) or logged with warnings, ensuring robustness in large-scale sequence processing pipelines.