⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,22 @@
+# BioSequence Attribute Management API
+
+This Go package (`obiseq`) provides a rich set of methods for managing metadata and structural attributes associated with biological sequences (`BioSequence`). Below is a semantic overview of the core functionalities:
+
+- **Key Discovery & Existence Checks**:  
+  - `Keys()` and `AttributeKeys()` return all attribute names (optionally excluding container/statistics fields or the `"definition"` key).  
+  - `HasAttribute(key)` verifies presence of a given attribute (including standard fields: `"id"`, `"sequence"`, `"qualities"`).
+
+- **Generic Attribute Access**:  
+  - `GetAttribute(key)` retrieves any attribute value (as `interface{}`), with thread-safe locking.  
+  - `SetAttribute(key, value)` assigns values to attributes (including automatic conversion for `"id"`, `"sequence"` and `"qualities"`).
+
+- **Typed Attribute Retrieval**:  
+  - Type-specific getters (`GetIntAttribute`, `GetFloatAttribute`, `GetStringAttribute`, etc.) ensure safe conversion and *auto-upgrade* of stored values (e.g., string `"42"` → integer `42`).  
+  - Supports maps (`GetIntMap`, `GetStringMap`) and slices (`GetIntSlice`).
+
+- **Convenience & Domain-Specific Helpers**:  
+  - `Count()` / `SetCount()`: manage observation frequency (default = 1).  
+  - OBITag indexing: `OBITagRefIndex()` / `SetOBITagRefIndex()`, and geometry variants (`geomref`). Supports flexible input map types with dynamic conversion.  
+  - Coordinate & landmark support: `GetCoordinate()` / `SetCoordinate()`, and `landmark_id`-based operations (`IsALandmark()`, `GetLandmarkID()`).
+
+All methods are designed for robustness: they handle type conversions gracefully, use locking to ensure concurrency safety, and provide fallbacks (e.g., default count = 1). The API abstracts internal storage (`annotations` map) while exposing a clean, consistent interface for sequence annotation manipulation.
@@ -0,0 +1,41 @@
+# BioSequence: A High-Performance Biological Sequence Representation
+
+The `obiseq` package defines the `BioSequence` struct, a memory-efficient and thread-safe container for biological DNA sequences. Beyond raw sequence data (`[]byte`), it supports rich metadata and operations essential for NGS pipelines.
+
+## Core Features
+
+- **Metadata Fields**:  
+  - `id`: Unique sequence identifier.  
+  - `source`: Filename (without path/extension) of origin.  
+  - `definition`: Optional descriptive text, stored in annotations.
+
+- **Sequence & Quality Support**:  
+  - Stores sequence as lowercase `[]byte` (normalized via in-place lowercasing).  
+  - Quality scores (`Quality = []uint8`) with fallback to default Phred+40 values when missing.  
+  - Methods for incremental writing (`Write`, `WriteByte`) and clearing.
+
+- **Annotations & Features**:  
+  - Generic `Annotation` map (`map[string]interface{}`) for flexible metadata.  
+  - Thread-safe access via `annot_lock` mutex (explicit locking/unlocking methods).  
+  - Raw feature table storage (`[]byte`, e.g., EMBL/GenBank features).
+
+- **Biological Relationships**:  
+  - `paired`: Pointer to mate/read-pair sequence.  
+  - `revcomp`: Pointer to reverse-complement variant (lazy or precomputed).
+
+- **Introspection & Utility**:  
+  - `Len()`, `HasSequence()`, `Composition()` (nucleotide counts: a,c,g,t,o).  
+  - MD5 checksums (`MD5()` and `MD5String()`) for deduplication.  
+  - Memory footprint estimation (`MemorySize()`), critical for streaming/batching.
+
+- **Efficiency Optimizations**:  
+  - `NewBioSequenceOwning`/`TakeQualities`: Zero-copy slice adoption (caller must not reuse input).  
+  - `Recycle()`: Reuses slices via pool-aware functions (`RecycleSlice`, etc.).  
+  - Global counters track creation/destruction/in-memory sequences for diagnostics.
+
+- **Safety & Compatibility**:  
+  - Copy semantics via `Copy()` (deep copy of slices + annotations).  
+  - Validation: `HasValidSequence` enforces allowed characters (`a-z`, `-`, `.`, `[`, `]`).  
+  - Uses unsafe string conversion for quality ASCII output (Phred shift configurable via `obidefault`).
+
+Designed for scalability in large-scale metabarcoding workflows (e.g., OBITools4), balancing performance, correctness, and extensibility.
@@ -0,0 +1,35 @@
+# `obiseq` Package: Semantic Overview
+
+The `obiseq` package provides a robust, thread-safe implementation of biological sequence objects in Go. It defines the core `BioSequence` type and associated utilities for handling nucleotide sequences (DNA/RNA), quality scores, annotations, features, memory management, and metadata operations.
+
+### Core Functionalities
+
+- **Construction & Initialization**  
+  - `NewEmptyBioSequence(cap)` creates an empty sequence with optional preallocated capacity.  
+  - `NewBioSequence(id, seq, def)` builds a basic sequence with ID (case-normalized), byte-level sequence (`[]byte`), and definition.  
+  - `NewBioSequenceWithQualities(...)` extends the above with per-base quality scores (`[]byte` or `Quality`).  
+
+- **Accessors & Properties**  
+  - `Id()`, `Definition()` return metadata fields.  
+  - `Sequence()` returns the normalized (lowercase) sequence as a copy of internal bytes.  
+  - `Len()` returns the length (number of bases).  
+  - `String()` provides a human-readable sequence string.  
+
+- **Quality & Feature Support**  
+  - `HasQualities()` checks if quality scores are present.  
+  - `Qualities()`, `SetQualities(...)` manage per-base quality data (with fallback to default values).  
+  - `Features()` retrieves optional feature annotations as a string.  
+
+- **Annotation System**  
+  - `Annotations()`, `HasAnnotation()` allow inspection of arbitrary metadata (key-value map).  
+  - Thread-safe via internal `sync.Mutex`, exposed through `AnnotationsLock()`.  
+
+- **Utility & Safety**  
+  - `Recycle()` safely resets internal slices and annotations (enables object pooling). Handles nil receivers gracefully.  
+  - `Copy()` performs deep copy of all fields, including annotations and locks (new mutex).  
+  - `MD5()` computes the MD5 hash of the sequence bytes.  
+
+- **Analysis Methods**  
+  - `Composition()` returns a nucleotide count map (`a`, `c`, `g`, `t`, and `'o'` for others), case-insensitive.
+
+All operations are designed with performance, safety (nil-safety, copy semantics), and extensibility in mind—ideal for bioinformatics pipelines requiring immutable or pooled sequence handling.
@@ -0,0 +1,37 @@
+# `obiseq` Package: BioSequence Collection Management
+
+The `obiseq` package provides a high-performance, memory-efficient implementation for managing collections of biological sequences (`BioSequence`) in Go. Its core type is `BioSequenceSlice`, a slice of pointers to `BioSequence` objects, optimized for batch processing in metagenomic pipelines.
+
+### Key Functionalities
+
+- **Memory Pooling & Allocation Control**:  
+  `NewBioSequenceSlice` and `MakeBioSequenceSlice` allow creating slices with optional capacity hints.  
+  `EnsureCapacity(capacity)` dynamically grows the underlying slice while logging warnings or panicking on persistent allocation failures.
+
+- **Efficient Element Management**:  
+  - `Push(sequence)`: Appends a sequence to the end.  
+  - `Pop()`: Removes and returns the last element (nil-safe).  
+  - `Pop0()`: Efficiently removes and returns the first element.  
+
+- **Collection Metadata Queries**:  
+  - `Len()`: Returns number of sequences in the slice.  
+  - `Size()`: Computes total sequence length (summing all `.Len()`).  
+  - `NotEmpty()`: Boolean check for non-empty collections.  
+
+- **Attribute Aggregation**:  
+  `AttributeKeys(skip_map, skip_definition)` aggregates all attribute keys across sequences into a set—useful for schema inference or validation.
+
+- **Sorting Capabilities**:  
+  - `SortOnCount(reverse)`: Sorts by read count (descending/ascending).  
+  - `SortOnLength(reverse)`: Sorts by sequence length.
+
+- **Taxonomy Integration**:  
+  `ExtractTaxonomy(taxonomy, seqAsTaxa)` builds or extends a taxonomic tree from sequence paths.  
+  When `seqAsTaxa=true`, it injects pseudo-taxonomic labels for individual sequences (e.g., `OTU:SEQ0000012345 [seqID]@sequence`), enabling unified taxonomic/rarefaction workflows.
+
+### Design Highlights
+
+- Minimal allocations via manual slice management and `slices.Grow`.  
+- Explicit niling of popped elements to aid garbage collection.  
+- Integrated logging (via `logrus`) for allocation issues—critical in large-scale NGS data processing.  
+- Designed to support `BioSequenceBatch`, a higher-level abstraction for streaming or parallelizable sequence batches.
@@ -0,0 +1,32 @@
+# BioSequence Classifier Module Overview  
+
+This Go package (`obiseq`) provides a flexible and thread-safe framework for classifying biological sequences using different strategies. Each classifier implements four core methods:  
+- `Code(sequence) int`: assigns an integer class to a sequence.  
+- `Value(k) string`: retrieves the original value (or representation) for class index *k*.  
+- `Reset()`: clears internal state.  
+- `Clone() *BioSequenceClassifier`: creates a fresh copy of the classifier.
+
+## Supported Classifier Types  
+
+1. **`AnnotationClassifier(key, na)`**  
+   Classifies sequences based on a single annotation field. Missing annotations default to `na`. Internally maps string values → integer codes via a thread-safe dictionary.
+
+2. **`DualAnnotationClassifier(key1, key2, na)`**  
+   Uses *two* annotation fields. Combines them (as JSON array) to form unique class identifiers, enabling multi-dimensional classification.
+
+3. **`PredicateClassifier(predicate)`**  
+   Binary classifier: returns `1` if the provided predicate function evaluates to true, else `0`. Useful for rule-based grouping (e.g., length > 200).
+
+4. **`HashClassifier(size)`**  
+   Assigns sequences to one of `size` buckets via CRC32 hash of the raw sequence. Deterministic and memory-efficient, but may cause collisions.
+
+5. **`SequenceClassifier()`**  
+   Unique class per *exact* sequence string (case-sensitive). Uses a lock-protected map to deduplicate and index sequences.
+
+6. **`RotateClassifier(size)`**  
+   Cyclic assignment: sequence *i* → class `i mod size`. No memoization; state resets only manually.
+
+7. **`CompositeClassifier(...)`**  
+   Combines multiple classifiers: concatenates their integer outputs (e.g., `"3:17:0"`) to form a composite class key. Enables layered or hierarchical classification.
+
+All classifiers are immutable after creation (state is internal and synchronized), supporting concurrent use in pipelines.
@@ -0,0 +1,20 @@
+# Semantic Description of `obiseq` Comparison Functions
+
+The `obiseq` package provides utility functions for comparing biological sequence records (`*BioSequence`) based on different fields. These comparators are designed to support sorting, deduplication, or grouping operations in bioinformatics workflows.
+
+- **`CompareSequence(a, b *BioSequence) int`**  
+  Compares the raw nucleotide or amino acid sequences (`a.sequence`) lexicographically using `bytes.Compare`. Returns:
+  - `<0` if `a < b`,  
+  - `0` if equal,  
+  - `>0` if `a > b`.
+
+- **`CompareQuality(a, b *BioSequence) int`**  
+  Compares the base quality scores (`a.qualities`) lexicographically (as byte strings), following same semantics as above. Useful for sorting reads by quality profiles.
+
+- **Commented-out `CompareAttributeBuilder(key string)`**  
+  A planned higher-order function to generate custom comparators based on sequence attributes (e.g., `RG`, `NM`). It would:
+  - Extract attribute values using `.GetAttribute(key)`.
+  - Handle missing attributes (treat absent as "less than" present).
+  - Eventually support typed comparisons for ordered types.
+
+These functions assume `BioSequence` implements a consistent internal structure with `.sequence []byte` and `.qualities []byte`. They enable flexible, field-based ordering in collections of sequencing records.
@@ -0,0 +1,28 @@
+# Semantic Description of `obiseq` Expression-Based Workers
+
+This module provides **expression-driven transformation workers** for biological sequence objects (`BioSequence`). It leverages a custom expression language (via `OBILang`) to dynamically compute values based on sequence metadata and content.
+
+## Core Components
+
+- **`Expression(expression string)`**:  
+  Returns a function that evaluates the given expression in context. The evaluation scope includes:
+  - `annotations`: sequence annotations (metadata).
+  - `sequence`: the full `BioSequence` object itself.
+
+- **`EditIdWorker(expression string)`**:  
+  A sequence worker that updates the *ID* of a `BioSequence` by evaluating the expression.  
+  - On success: sets `sequence.Id()` to string representation of result.
+  - On failure: logs and returns an error with context.
+
+- **`EditAttributeWorker(key string, expression string)`**:  
+  A sequence worker that sets a *custom attribute* (identified by `key`) on the sequence, using evaluated expression result.  
+  - Supports arbitrary metadata enrichment.
+  - Errors are reported with sequence ID and failed expression.
+
+## Use Cases
+
+- Generate new IDs from annotation fields (e.g., `"gene_" + annotations["locus_tag"]`).
+- Compute and store derived attributes (e.g., GC content, ORF length) as sequence metadata.
+- Apply conditional logic or transformations across large sets of sequences in pipelines.
+
+All workers conform to the `SeqWorker` interface, enabling composition and chaining.
@@ -0,0 +1,27 @@
+# Semantic Description of `obiseq` Package
+
+The `obiseq` package provides utilities for handling **IUPAC nucleotide ambiguity codes** in biological sequences.
+
+## Core Components
+
+- `_iupac`: A lookup table mapping lowercase ASCII letters (`a`–`z`) to numeric IUPAC nucleotide codes:
+  - `A=1`, `C=2`, `G=4`, `T/U=8` (standard bases)
+  - Ambiguous codes are bitwise OR combinations:  
+    e.g., `R = A|G = 1+4=5`, `Y = C|T = 2+8=10`, etc.
+- Invalid or non-nucleotide characters map to `0`.
+
+## Key Functionality
+
+### `SameIUPACNuc(a, b byte) bool`
+Performs **case-insensitive comparison** of two nucleotide symbols using IUPAC ambiguity rules.
+
+- Converts uppercase letters to lowercase via bitwise OR (`|= 32`).
+- For valid nucleotides, checks if their IUPAC codes have **non-zero bitwise AND**:
+  - Returns `true` only if the symbols share at least one possible base.
+    *Example*: `'R' & 'A' → (5 & 1) = 1 > 0 ⇒ true`  
+    `'Y' & 'G' → (10 & 4) = 0 ⇒ false`
+- For non-IUPAC or invalid characters, falls back to exact equality (`a == b`).
+
+## Use Case
+
+Enables robust comparison of DNA/RNA sequences where ambiguity codes (e.g., `N`, `R`, `W`) are used—critical for alignment, variant calling, or primer design tools.
@@ -0,0 +1,35 @@
+# `obiseq` Package: Sequence Concatenation via `.Join()`
+
+The `BioSequence.Join()` method enables semantic concatenation of two biological sequences (e.g., DNA, RNA, or protein strings).  
+
+- **Signature**:  
+  ```go
+  func (sequence *BioSequence) Join(seq2 *BioSequence, inplace bool) *BioSequence
+  ```
+
+- **Purpose**:  
+  Combines the current sequence (`sequence`) with a second one (`seq2`), returning a new or modified `BioSequence`.
+
+- **Parameters**:  
+  - `seq2`: The sequence to append. Must be a valid `*BioSequence`.  
+  - `inplace`: Boolean flag: if `true`, modifies the receiver in-place; otherwise, operates on a copy.
+
+- **Semantics**:  
+  - If `inplace == false`, the method first creates a deep copy of the original sequence to avoid side effects.  
+  - It then appends `seq2.Sequence()` (the underlying string/byte representation) to the target sequence using an internal `.Write()` method.  
+  - The final concatenated result is returned as a `*BioSequence`.
+
+- **Behavioral Guarantees**:  
+  - *Pure operation*: When `inplace = false`, the original sequences remain unaltered.  
+  - *Chaining-friendly*: Returns a pointer, enabling method chaining (e.g., `seq.Join(a, false).Join(b, true)`).
+
+- **Use Cases**:  
+  - Building multi-domain proteins or gene fusions.  
+  - Merging fragments from sequencing reads.  
+  - Constructing synthetic constructs in silico.
+
+- **Assumptions**:  
+  - `BioSequence.Sequence()` returns a valid string/byte slice.  
+  - `.Write(...)` handles appending correctly (e.g., no validation of biological compatibility — e.g., frame shifts are not checked).  
+
+This method supports flexible, functional-style sequence manipulation while preserving memory safety via optional in-place mutation.
@@ -0,0 +1,20 @@
+## BioSequence.Kmers(k int) — Semantic Description
+
+The `Kmers` method is a generator function that yields all contiguous *k*-length subsequences (called **k-mers**) from a biological sequence (`BioSequence`).  
+
+- It operates on `[]byte` data, assuming the underlying sequence is stored as a byte slice (e.g., DNA bases `A`, `C`, `G`, `T`).  
+- Uses Go’s new iterator protocol (`iter.Seq[[]byte]`) for memory-efficient, lazy evaluation.  
+- Validates input: returns an empty iterator if `k ≤ 0` or exceeds sequence length.  
+- Iterates linearly from index `i = 0` to `len(seq) - k`, extracting slices of length *k*.  
+- Each yielded value is a **non-copying slice view** (efficient, but mutable if original data changes).  
+- Supports early termination: the consumer can stop iteration by returning `false` from the yield callback.  
+- Designed for downstream tasks like sequence analysis, motif discovery, or hashing (e.g., in k-mer counting).  
+- Does *not* handle reverse-complement or ambiguous bases—assumes raw sequence input.  
+
+Usage example:  
+```go
+for kmer := range seq.Kmers(3) {
+    fmt.Printf("%s\n", string(kmer))
+}
+```  
+This yields all 3-mers (e.g., `"ACG"`, `"CGT"`...) in order.
@@ -0,0 +1,41 @@
+# Semantic Description of `obiseq` Language Extensions
+
+The `package obiseq` extends the [Gval](https://github.com/PaesslerAG/gval) expression language with domain-specific functions tailored for bioinformatics and data processing. It integrates utility helpers from `obiutils` to provide type-flexible, robust operations over sequences and collections.
+
+## Core Functionalities
+
+- **Data Inspection**:  
+  `len`, `ismap`, `isvector` — retrieve size and type information.
+
+- **Aggregation & Comparison**:  
+  `min`, `max` — compute extremal values in slices/maps (via `obiutils.Min/Max`).  
+  *(Note: commented-out helper functions suggest prior attempts at manual implementations.)*
+
+- **Type Conversion**:  
+  `int`, `numeric` (→ float64), `bool`, `string` — safely coerce arbitrary inputs to target types; fail with fatal logs on invalid data.
+
+- **String Manipulation**:  
+  `sprintf`, `subspc` (replace spaces with underscores), `replace` (regex-based substitution), and `substr` — support formatting, normalization, and slicing.
+
+- **Sequence Analysis (Bioinformatics)**:  
+  `gc`, `gcskew`, and `composition` — compute nucleotide composition metrics for DNA/RNA sequences (`BioSequence`).  
+  - `gc`: GC content ratio (excluding ambiguous bases `'o'`)  
+  - `gcskew`: `(G−C)/(G+C)` asymmetry measure  
+  - `composition`: returns a map of base counts (e.g., `"a":20.0`, `"g":15.0`)
+
+- **Element Access**:  
+  `elementof(seq, idx)` — retrieves item at index/key for slices (`[]interface{}`), maps (`map[string]interface{}`), or strings (by byte position).
+
+- **Control Flow**:  
+  `ifelse(cond, then_val, else_val)` — conditional branching within expressions.
+
+- **Quality Support**:  
+  `qualities(seq)` — extracts per-base quality scores as a float slice from sequencing reads.
+
+## Design Principles
+
+- **Dynamic Typing**: Accepts `...interface{}` arguments for flexibility.
+- **Error Handling**: Uses fatal logging (`log.Fatalf`) on conversion failures; returns typed errors for runtime issues.
+- **Extensibility**: Built atop `gval.Language`, enabling custom expression evaluation in pipelines (e.g., filtering reads via GC thresholds).
+
+This package serves as a bridge between high-level scripting and low-level biosequence computation, ideal for rule-based filtering or annotation in NGS workflows.
@@ -0,0 +1,39 @@
+# Semantic Description of `obiseq` Statistics and Merging Features
+
+This package provides infrastructure for **tracking, aggregating, and merging statistical occurrences** of sequence attributes across biological sequences (`BioSequence`). It supports both **count-based and weighted statistics**, with thread-safe operations.
+
+## Core Components
+
+- `StatsOnValues`: A concurrent map (`map[string]int`) with R/W locking to store occurrence counts per attribute value (e.g., taxon, primer, quality bin).
+- `StatsOnDescription`: Defines *how* to extract and weight statistics from a sequence (e.g., count per read, or sum of quality scores).
+- `StatsOnSlotName(key)`: Generates internal annotation keys (e.g., `"merged_taxon"`) to store precomputed statistics.
+
+## Key Functionalities
+
+1. **Per-Sequence Statistics Initialization & Update**
+   - `StatsOn(desc, na)`: Ensures a statistics slot exists for attribute `desc.Key`, initializes if needed.
+   - `StatsPlusOne(...)`: Adds contribution of a *single* sequence to the statistics (e.g., increment count for its taxon).
+
+2. **Thread-Safe Aggregation**
+   - `Merge(*StatsOnValues)`: Safely merges counts from another `StatsOnValues`, used to combine per-sequence stats.
+
+3. **Sequence Merging with Stat Propagation**
+   - `BioSequence.Merge(...)`: 
+     - Combines two sequences (e.g., consensus/overlap).
+     - Updates statistics for specified attributes (`statsOn`), preserving or aggregating counts.
+     - Resolves conflicting annotations by deleting non-merged fields if mismatched.
+
+4. **Bulk Merging**
+   - `BioSequenceSlice.Merge(...)`: Efficiently merges *N* sequences into one, recycling inputs and updating statistics incrementally.
+
+## Use Cases
+
+- Tracking taxonomic assignments across merged reads.
+- Aggregating primer or barcode counts in amplicon merging.
+- Summarizing quality scores, abundance weights, or custom metadata during consensus building.
+
+## Design Notes
+
+- Uses `sync.RWMutex` for safe concurrent access.
+- Supports only JSON-marshalable, serializable statistics (via `MarshalJSON`).
+- Enforces type safety: only strings/integers/booleans allowed for attribute values.
@@ -0,0 +1,19 @@
+# BioSequence Pairing Functionality
+
+This package provides semantic tools for managing biological sequence pairings—typically used in genomics (e.g., paired-end reads). Key features:
+
+- **Single-sequence pairing**:  
+  - `IsPaired()` checks if a sequence is currently paired.  
+  - `PairedWith()` returns the linked partner, or `nil`.  
+  - `PairTo(p)` establishes a bidirectional link between two sequences.  
+  - `UnPair()` safely severs the pairing on both ends.
+
+- **Batch (slice) handling**:  
+  - `IsPaired()` and `UnPair()` operate uniformly across all sequences in a slice.  
+  - `PairedWith()` returns the corresponding paired slice (element-wise).  
+  - `PairTo(p)` enforces length compatibility and pairs sequences index-by-index.  
+
+- **Error handling**:  
+  - Mismatched slice lengths during `PairTo` trigger a fatal log (via Logrus), preventing inconsistent pairings.
+
+Semantically, the API supports both *atomic* and *bulk* pairing operations while preserving consistency through bidirectional references—ideal for processing paired-end sequencing data.
@@ -0,0 +1,34 @@
+# Semantic Overview of `obiseq` Package Functionalities
+
+This Go package (`obiseq`) provides memory-efficient utilities for managing slices and annotations—key data structures in biosequence processing.
+
+## Slice Management
+
+- **`GetSlice(capacity int) []byte`**  
+  Retrieves a reusable `[]byte` with ≥ requested capacity. For capacities ≤1024 bytes, it pulls from a `sync.Pool` (`_BioSequenceByteSlicePool`). Larger slices are freshly allocated.
+
+- **`RecycleSlice(s *[]byte)`**  
+  Clears and recycles small slices (≤1024 bytes) back to the pool. For large slices (≥100 KB), it nils them and triggers explicit `runtime.GC()` every ~256 MB of discarded memory to prevent heap bloat.
+
+- **`CopySlice(src []byte) []byte`**  
+  Efficiently copies a source slice into a pooled or newly allocated destination, preserving semantics without unnecessary allocations.
+
+## Annotation Management
+
+- **`BioSequenceAnnotationPool`**  
+  A `sync.Pool` for reusable map-based annotations (`map[string]string`, inferred from usage), initialized with capacity 1.
+
+- **`GetAnnotation(values ...Annotation) Annotation`**  
+  Fetches an annotation map from the pool, optionally pre-populated via shallow copy of input annotations using `obiutils.MustFillMap`.
+
+- **`RecycleAnnotation(a *Annotation)`**  
+  Clears all keys from an annotation map and returns it to the pool for reuse.
+
+## Design Rationale
+
+The package prioritizes low-latency, high-throughput scenarios (e.g., NGS data pipelines) by minimizing GC pressure via:
+- Tiered pooling strategy (`small` vs `large`)
+- Explicit garbage collection triggers for large-object churn
+- Safe reuse patterns avoiding aliasing or stale references
+
+All operations are thread-safe via `sync.Pool` and atomic counters.
@@ -0,0 +1,33 @@
+# Sequence Predicate Framework in `obiseq`
+
+This Go package provides a flexible and composable predicate system for filtering biological sequences (`BioSequence`) based on diverse criteria.
+
+## Core Concepts
+
+- **`SequencePredicate`**: A function type `func(*BioSequence) bool`, enabling conditional logic on sequences.
+- **Predicate Composition**: Supports logical operations (`And`, `Or`, `Xor`, `Not`) and chaining.
+- **Paired-end Support**: Predicates can be adapted to consider read pairs via `PredicateOnPaired` and `PairedPredicat`, with modes:  
+  - `ForwardOnly`: Only the forward read is evaluated.  
+  - `ReverseOnly`, `And`, `Or`, `AndNot`, `Xor`: Combine forward and reverse evaluations.
+
+## Built-in Predicates
+
+| Predicate | Description |
+|-----------|-------------|
+| `HasAttribute(name)` | Checks if a sequence has an annotation with the given name. |
+| `IsAttributeMatch(name, pattern)` | Tests if a named annotation matches the provided regex (case-sensitive). |
+| `IsMoreAbundantOrEqualTo(count)` / `IsLessAbundantOrEqualTo(count)` | Filters by sequence abundance (count field). |
+| `IsLongerOrEqualTo(length)` / `IsShorterOrEqualTo(length)` | Filters by sequence length. |
+| `OccurInAtleast(sample, n)` | Checks if the sequence appears in at least *n* samples (via description stats). |
+| `IsSequenceMatch(pattern)` | Matches the raw sequence against a regex (case-insensitive). |
+| `IsDefinitionMatch(pattern)` | Matches the definition/description line against a regex. |
+| `IsIdMatch(pattern)` / `IsIdIn(ids...)` | Filters by sequence ID using regex or explicit set. |
+| `ExpressionPredicat(expression)` | Evaluates a custom boolean expression (via OBILang) using annotations and sequence metadata. |
+
+## Design Highlights
+
+- **Null-safe**: `nil` predicates are handled gracefully in compositions.
+- **Extensible**: Custom predicates can be defined and combined seamlessly.
+- **Logging & Safety**: Invalid regex patterns or expression syntax trigger fatal errors; runtime evaluation issues emit warnings.
+
+This framework enables powerful, declarative filtering pipelines for high-throughput sequencing data analysis.
@@ -0,0 +1,35 @@
+# BioSequence Reverse Complement Functionality
+
+This Go package (`obiseq`) provides utilities for computing the reverse complement of biological sequences (e.g., DNA), including support for quality scores and structured metadata.
+
+## Core Functions
+
+- **`nucComplement(n byte) byte`**  
+  Returns the nucleotide complement using a lookup table (`_revcmpDNA`). Handles special cases:  
+  - `.` / `-` → unchanged (gaps)  
+  - `[`, `]` → swapped (`[` ↔ `]`)  
+  - A–Z letters → complemented (case-insensitive via bitwise masking)  
+  - Unknown characters → `'n'`
+
+- **`BioSequence.ReverseComplement(inplace bool) *BioSequence`**  
+  Performs reverse complement on the sequence and (if present) its quality string:  
+  - If `inplace = false`, a copy is made; original preserved.  
+  - Reverses indices and complements each base using `nucComplement`.  
+  - Also reverses the quality array symmetrically.  
+  - Caches result in `sequence.revcomp` for reuse.
+
+- **`BioSequence._revcmpMutation() *BioSequence`**  
+  Adjusts mutation metadata (e.g., `"pairing_mismatches"`) to reflect the reversed-complement orientation:  
+  - Reverses and complements symbolic mutation strings (e.g., `"A>T"` → `"T>A"`).  
+  - Updates positional indices to match reversed sequence coordinates.
+
+- **`ReverseComplementWorker(inplace bool) SeqWorker`**  
+  Returns a reusable `SeqWorker` function for batch processing: applies reverse complement to each sequence in a stream.
+
+## Design Notes
+
+- Uses ASCII bitwise tricks (`&31`, `|0x20`) for case-insensitive indexing and lowercase output.  
+- Supports non-standard symbols (e.g., IUPAC ambiguity codes via lookup table).  
+- Integrates quality scores and structured attributes seamlessly.  
+
+> Ideal for NGS preprocessing pipelines where orientation matters (e.g., paired-end alignment, variant calling).
@@ -0,0 +1,19 @@
+## Semantic Description of `obiseq` Package Functionality
+
+The `obiseq` package provides core bioinformatics utilities for nucleic acid sequence manipulation in Go. It centers around two key operations:
+
+- **Nucleotide Complementation (`nucComplement`)**  
+  Implements standard Watson-Crick base pairing rules: `A↔T`, `C↔G`. It also handles ambiguous or symbolic characters (e.g., `'n' → 'n'`, `'[ ↔ ]'`), preserving non-standard symbols like gaps (`'-'`) and missing data (`'.'`). This function serves as the atomic building block for reverse-complement logic.
+
+- **Reverse Complementation (`BioSequence.ReverseComplement`)**  
+  A method on the `BioSequence` type that returns a new (or in-place modified) sequence representing:
+  - The *reverse* of the original nucleotide string, followed by  
+  - Each base replaced with its complement (via `nucComplement`).  
+
+  The method supports two modes:
+  - **Non-destructive (`inplace=false`)**: Returns a new `BioSequence`, leaving the original unchanged.
+  - **In-place (`inplace=true`)**: Modifies and returns the same object for memory efficiency.
+
+  Crucially, it preserves associated quality scores (e.g., Phred-scaled sequencing qualities), reversing their order to match the reversed sequence—ensuring correctness in downstream analyses like alignment or variant calling.
+
+Tests validate both functions across edge cases: degenerate bases, ambiguous symbols, and quality-aware sequences—confirming robustness for typical NGS (Next-Generation Sequencing) workflows.
@@ -0,0 +1,13 @@
+# `obiseq.Subsequence` Functionality Overview
+
+The `Subsequence()` method extracts a contiguous segment from a biological sequence (`BioSequence`), supporting both linear and circular topologies.
+
+- **Input validation**: Checks ensure `from < to` (unless circular), positions are non-negative, and bounds respect sequence length.
+- **Circular handling**: Positions exceeding the sequence length wrap around using modular arithmetic; debug logs record corrections.
+- **Linear extraction**: When `from < to`, it slices the underlying nucleotide/peptide sequence and, if present, its quality scores.
+- **Circular extraction**: When `from > to`, it concatenates two linear segments: from `from` → end, and start → `to`.
+- **Metadata preservation**: Quality scores (if available) and annotations are copied to the new subsequence.
+- **ID formatting**: The resulting sequence ID is suffixed with `[from..to]` (1-based indexing).
+- **Mutation tracking**: A private `_subseqMutation()` adjusts stored pairing mismatch positions by subtracting the extraction shift, ensuring coordinate consistency post-extraction.
+
+This enables robust subsequence generation for genomic analysis workflows involving circular genomes (e.g., plasmids) or fragmented reads.
@@ -0,0 +1,29 @@
+# `obiseq` Package: Subsequence Extraction Functionality
+
+The `Subsequence()` method enables extraction of a contiguous segment from biological sequence data (`BioSequence`). It supports both linear and circular (wrapped) slicing.
+
+- **Input Parameters**:
+  - `from`, `to`: 0-based inclusive indices defining the slice range.
+  - `circular`: boolean flag enabling wrap-around when `from > to`.
+
+- **Behavior**:
+  - For linear (`circular = false`), `from ≤ to`, and indices within bounds `[0, len(seq))`.
+  - For circular (`circular = true`), allows wrap-around (e.g., `from=3, to=2` on a 4-mer yields indices `[3,0,1]`).
+  - Validates inputs: returns descriptive errors for:
+    - `from > to` (non-circular),
+    - out-of-bounds indices (`< 0` or `≥ length`),
+    - invalid ranges.
+
+- **Quality Support**:
+  - When sequence includes base quality scores (`BioSequenceWithQualities`), the method preserves corresponding sub-slice of `Quality[]`.
+
+- **Return Value**:
+  - Returns a new `BioSequence` (or subclass) instance containing the extracted subsequence and its optional qualities.
+
+- **Use Case**:
+  - Ideal for region-of-interest extraction (e.g., primer binding sites, domain segments), especially in circular genomes or plasmids.
+
+- **Testing**:
+  - Unit tests (`TestSubsequence`) cover valid/invalid inputs, circular/non-circular modes, and quality consistency.
+
+This functionality provides robust, semantics-aware slicing for biosequence manipulation in Go.
@@ -0,0 +1,26 @@
+# Taxonomic Classification via `TaxonomyClassifier`
+
+The `obiseq` package provides a taxonomic classification mechanism through the `TaxonomyClassifier` function.
+
+- **Purpose**: Constructs a reusable classifier for biological sequences based on taxonomic hierarchy.
+- **Inputs**:
+  - `taxonomicRank`: Target rank (e.g., `"species"`, `"genus"`).
+  - `taxonomy`: Reference taxonomy (`*obitax.Taxonomy`), with fallback via `.OrDefault(true)`.
+  - `abortOnMissing`: Boolean flag to enforce strict taxon resolution.
+
+- **Core Logic**:
+  - For each sequence, retrieves its `Taxon`, then drills down to the requested rank using `.TaxonAtRank()`.
+  - If `abortOnMissing` is true, exits on failure to resolve the taxon or rank.
+  - Internally maps `*TaxNode`s to integer codes for efficient storage/comparison.
+
+- **Returned Object (`BioSequenceClassifier`)**:
+  - `Code(sequence) int`: Assigns a unique integer code to the taxonomic assignment of a sequence.
+  - `Value(code) string`: Returns the scientific name corresponding to a code.
+  - `Reset()`: Reinitializes internal mappings (useful for batch processing).
+  - `Clone() *BioSequenceClassifier`: Creates a fresh, identical classifier instance.
+
+- **Design Rationale**:
+  - Uses integer codes to avoid repeated string operations and enable fast indexing (e.g., for counting).
+  - Supports both strict (`abortOnMissing=true`) and lenient classification modes.
+
+This design enables scalable, efficient taxonomic profiling of sequencing datasets.
@@ -0,0 +1,22 @@
+# Taxonomic Analysis Functions in `obiseq` Package
+
+This module provides tools for assigning taxonomic labels to biological sequences using a reference taxonomy.
+
+- **`TaxonomicDistribution(taxonomy)`**:  
+  Returns a map from taxonomic nodes to read counts, based on `taxid` annotations in the sequence metadata. It validates taxids against the taxonomy and enforces strict handling of aliases.
+
+- **`LCA(taxonomy, threshold)`**:  
+  Computes the *Lowest Common Ancestor* (LCA) of all taxonomic assignments for a sequence, weighted by their abundances.  
+  - Iteratively traverses upward from each taxon’s path in the taxonomy tree.  
+  - At each level, computes the relative weight (`rmax`) of the most frequent taxon.  
+  - Stops when `rmax < threshold`, returning:  
+    • the LCA taxon,  
+    • its confidence score (`rans`), and  
+    • total read count used.
+
+- **`AddLCAWorker(...)`**:  
+  Creates a `SeqWorker` function to annotate sequences with LCA results:  
+    - Sets attributes like `<slot>_taxid`, `<slot>_name`, and `<slot>_error` (rounded to 3 decimals).  
+    - Automatically appends `_taxid` if missing in `slot_name`.  
+
+All functions integrate with the OBITools4 ecosystem, supporting robust taxonomic inference for metabarcoding workflows.
@@ -0,0 +1,41 @@
+# Taxonomic Annotation Features in `obiseq` Package
+
+This package provides semantic taxonomic annotation capabilities for biological sequences (`BioSequence`). It integrates with a taxonomy database to assign, retrieve, and manage taxonomic identifiers (taxids) and related metadata.
+
+## Core Functions
+
+- **`Taxid()`**: Retrieves the taxonomic ID as a string (e.g., `"12345"` or `"NA"`), supporting multiple internal representations (`string`, `int`, `float64`). Returns `"NA"` if no taxid is set.
+
+- **`Taxon(taxonomy)`**: Returns the corresponding `*obitax.Taxon` object, or `nil` if taxid is `"NA"`.
+
+- **`SetTaxid(taxid, rank...)`**: Assigns a taxonomic ID to the sequence. Validates against default taxonomy; handles aliases and errors based on configuration flags (`FailOnTaxonomy`, `UpdateTaxid`). Optionally stores taxid under a custom rank (e.g., `"genus_taxid"`).
+
+- **`SetTaxon(taxon, rank...)`**: Assigns a `*obitax.Taxon` object directly; stores its string representation as taxid.
+
+## Rank-Specific Annotation
+
+- **`SetTaxonAtRank(taxonomy, rank)`**: Annotates the sequence with taxid and scientific name at a specified Linnaean rank (e.g., `"species"`, `"genus"`). Sets two attributes: `rank_taxid` and `rank_name`. Returns the taxon at that rank (or `nil`).
+
+- **Convenience wrappers**:
+  - `SetSpecies(...)`
+  - `SetGenus(...)`
+  - `SetFamily(...)`  
+    All delegate to `SetTaxonAtRank`.
+
+## Taxonomic Path & Metadata
+
+- **`SetPath(taxonomy)`**: Computes and stores the full taxonomic lineage (from root to species) as a string slice under attribute `"taxonomic_path"`.
+
+- **`Path()`**: Retrieves the stored taxonomic path; recomputes it if missing and a default taxonomy exists.
+
+- **`SetScientificName(taxonomy)`**: Stores the sequence’s species-level scientific name under `"scientific_name"`.
+
+- **`SetTaxonomicRank(taxonomy)`**: Stores the taxon’s rank (e.g., `"species"`, `"genus"`) under `"taxonomic_rank"`.
+
+## Error Handling & Configuration
+
+- Uses `logrus` and custom logging (`obilog`) for warnings/errors.
+- Behavior on taxonomy mismatches (e.g., unknown taxid, alias) is configurable via `obidefault` settings.
+- Ensures type consistency: taxid must be string, int, or float; invalid types trigger fatal errors.
+
+All methods are designed for seamless integration into bioinformatics pipelines, enabling robust taxonomic profiling of sequencing data.
@@ -0,0 +1,20 @@
+# Semantic Description of `obiseq` Package Functionalities
+
+This Go package provides **sequence filtering predicates** for biological sequences, integrated with taxonomic validation and hierarchy analysis.
+
+- `IsAValidTaxon(taxonomy, ...bool) SequencePredicate`:  
+  Returns a predicate that checks whether a sequence has an associated valid taxon in the given taxonomy.  
+  Optionally supports *auto-correction* of outdated/incorrect `taxid` values to match the current taxonomy node.
+
+- `IsSubCladeOf(taxonomy, parent) SequencePredicate`:  
+  Filters sequences whose taxonomic assignment is a descendant (sub-clade) of the specified `parent` taxon.
+
+- `IsSubCladeOfSlot(taxonomy, key) SequencePredicate`:  
+  Enables filtering based on a *sequence attribute* (e.g., `"taxon"` or `"classification"`) that holds a taxonomic label.  
+  Validates the label against the taxonomy, then checks if the sequence’s assigned taxon falls under it.
+
+- `HasRequiredRank(taxonomy, rank) SequencePredicate`:  
+  Ensures the sequence’s taxon is assigned at or below a specified rank (e.g., `"species"`, `"genus"`).  
+  Validates the requested `rank` against taxonomy’s rank list; exits on invalid input.
+
+All predicates follow a functional, composable design pattern (`SequencePredicate = func(*BioSequence) bool`), enabling flexible pipeline construction (e.g., filtering, classification validation).
@@ -0,0 +1,22 @@
+# Taxonomic Annotation Workers in `obiseq`
+
+This Go package provides functional workers for annotating biological sequences with taxonomic information using a hierarchical taxonomy (e.g., from NCBI or UNITE). Each worker is implemented as a `SeqWorker`—a function that processes one sequence and returns an updated slice of sequences.
+
+- **`MakeSetTaxonAtRankWorker(taxonomy, rank)`**:  
+  Assigns a taxonomic label at *a specific rank* (e.g., `"genus"`, `"family"`). Validates that the requested `rank` exists in the taxonomy before proceeding.
+
+- **`MakeSetSpeciesWorker(taxonomy)`**:  
+  Annotates each sequence with its inferred species name using the provided taxonomy.
+
+- **`MakeSetGenusWorker(taxonomy)`**:  
+  Adds genus-level taxonomic assignment to sequences.
+
+- **`MakeSetFamilyWorker(taxonomy)`**:  
+  Adds family-level taxonomic assignment.
+
+- **`MakeSetPathWorker(taxonomy)`**:  
+  Populates the full taxonomic path (e.g., `"Eukaryota;Metazoa;Chordata;..."`) for each sequence.
+
+All workers rely on methods of `BioSequence` (e.g., `.SetSpecies()`, `.SetPath()`), which internally use the `obitax.Taxonomy` object to resolve taxonomic IDs or names. Errors are logged via `logrus`; invalid ranks cause a fatal exit.
+
+These utilities support modular, pipeline-friendly taxonomic annotation—ideal for high-throughput metabarcoding workflows.
@@ -0,0 +1,18 @@
+# Semantic Description of `obiseq` Package Functionalities
+
+The `obiseq` package provides composable, higher-order worker functions for processing biological sequence data in Go. It defines three core functional types:
+
+- `SeqAnnotator`: In-place annotation of a single sequence (e.g., adding metadata).
+- `SeqWorker`: Processes one sequence and returns zero or more output sequences (1→N transformation).
+- `SeqSliceWorker`: Processes a slice of sequences and returns another slice (bulk pipeline stage).
+
+Key utilities include:
+
+- **`NilSeqWorker`**: Identity worker—returns the input sequence unchanged.
+- **`AnnotatorToSeqWorker`**: Converts an in-place annotator into a `SeqWorker`, preserving compatibility with pipeline interfaces.
+- **`SeqToSliceWorker`**: Lifts a `SeqWorker` to operate on slices, with configurable error handling (`breakOnError`). Supports dynamic slice growth and logging via `obilog`.
+- **`SeqToSliceFilterOnWorker`**: Filters sequences in a slice using a `SequencePredicate`, preserving order and avoiding unnecessary allocations.
+- **`SeqToSliceConditionalWorker`**: Applies a `SeqWorker` only to sequences satisfying a predicate; others pass through unchanged.
+- **`.ChainWorkers()`**: Method on `SeqWorker` to compose two workers sequentially (pipeline chaining), enabling modular, reusable workflows.
+
+All functions emphasize safety: errors are either propagated (`breakOnError = true`) or logged with warnings, ensuring robustness in large-scale sequence processing pipelines.