⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,30 @@
# ObiTax: Default Taxonomy Management
This Go package (`obitax`) provides utilities for managing a **default taxonomy instance**, enabling centralized configuration and safe fallback behavior.
## Core Features
-**Singleton-style default taxonomy**: A single global `Taxonomy` instance can be designated as *the* default via `.SetAsDefault()`.
-**Thread-safe access**: Uses `sync.Mutex` (implicitly via package-level variable usage) to ensure safe concurrent writes when setting the default.
-**Graceful fallback with `.OrDefault()`**:
- If a `Taxonomy` receiver is `nil`, the method automatically substitutes it with the default taxonomy.
- Supports optional panic on failure (`panicOnNil`) if no default is defined.
-**Utility checks**:
- `HasDefaultTaxonomyDefined()` → returns whether a default is currently set.
- `DefaultTaxonomy()` → retrieves the current global instance (if any).
## Design Intent
- Promotes **configuration reuse** and reduces boilerplate in client code.
- Supports robustness: avoids nil dereferences by allowing fallback to a globally configured taxonomy.
## Usage Pattern
```go
tax := NewTaxonomy("my-tax")
tax.SetAsDefault() // Now all `nil` receivers will resolve to this instance
result := someNilTax.OrDefault(true) // Uses default; panics only if none exists
```
@@ -0,0 +1,28 @@
# Semantic Description of `IFilterOnName` Functionality in the `obitax` Package
The `IFilterOnName` method enables filtering taxonomic data (`Taxon`) instances by name, supporting both **exact** and **pattern-based matching**, with optional case-insensitive comparison.
- Two overloaded versions exist:
- On `*Taxonomy`: delegates to its iterator.
- On `*ITaxon`: performs the actual filtering logic.
- **Parameters**:
- `name` (`string`) search term or regex pattern.
- `strict` (`bool`) — if true, performs exact name equality; otherwise treats `name` as a regex.
- `ignoreCase` (`bool`) — when true, performs case-insensitive matching (applies to both modes).
- **Core behavior**:
- Uses a `map` (`sentTaxa`) to avoid duplicate taxa (based on internal node ID).
- For `strict = true`: compares names using a dedicated equality method (`IsNameEqual`).
- For `strict = false`: compiles and applies a regex pattern (`regexp.MustCompile`) — prepends `(?i)` for case-insensitive matching.
- Filtering runs in a **goroutine**, streaming results into a new `ITaxon` iterator.
- Source channel is properly closed after iteration.
- **Return value**: a new `*ITaxon` iterator containing only matching taxa — preserving immutability and enabling chaining.
- **Use cases**:
- Find exact species names (e.g., *Homo sapiens*).
- Search using partial or regex patterns (e.g., `^Pan.*` for *Panthera* and related genera).
- Case-insensitive lookups (e.g., "homo sapiens", "HOMO SAPIENS").
The design emphasizes **efficiency**, **correctness** (deduplication), and **flexibility** in taxonomic querying.
@@ -0,0 +1,12 @@
# Semantic Description of `IFilterOnTaxRank` Functionality in the *obitax* Package
The `IFilterOnTaxRank` method enables semantic filtering of taxonomic data by rank (e.g., `"species"`, `"genus"`). It is implemented across multiple core types—`ITaxon`, `TaxonSet`, `TaxonSlice`, and `Taxonomy`—providing a unified interface for rank-based selection.
- **Core behavior**: Returns an `*ITaxon` iterator containing only taxa whose nodes rank matches the input string.
- **Rank normalization**: Internally, it resolves the requested `rank` against a taxonomys internal rank map via `ptax.ranks.Innerize(rank)`, ensuring consistent mapping and case-insensitive or canonical representation handling.
- **Efficiency**: Reuses the resolved rank pointer (`prank`) across consecutive taxa from the same `Taxonomy`, avoiding redundant lookups.
- **Concurrency-safe iteration**: Uses a goroutine to stream filtered results into the new iterators channel (`newIter.source`), enabling lazy evaluation and memory-efficient processing of large datasets.
- **Polymorphic dispatch**: Overloaded methods on `TaxonSet`, `TaxonSlice`, and `Taxonomy` delegate to the base iterator implementation, preserving consistency across input types.
- **Non-destructive**: Does not mutate source collections; instead produces a new iterator, supporting functional-style chaining.
This design supports scalable taxonomic querying in phylogenetic or biodiversity analysis pipelines, where filtering by hierarchical rank is essential.
@@ -0,0 +1,31 @@
# Semantic Overview of `obitax` Filtering Functionalities
The `obitax` package provides composable, iterator-based filtering methods for taxonomic data structures. All filters return lazy or buffered iterators (`*ITaxon`) enabling efficient, streaming-style traversal without materializing full collections.
## Core Filtering Operation: `IFilterOnSubcladeOf`
- **Purpose**: Filters elements belonging to a specific taxonomic subtree.
- **Behavior**:
- Accepts a `*Taxon` as reference root.
- Yields only taxa for which `IsSubCladeOf(taxon)` returns true (i.e., descendants of the given taxon).
- **Overloads**:
- On `*ITaxon`, `TaxonSet`, `TaxonSlice`, and `Taxonomy` — all delegate to the iterator variant.
- Ensures consistent interface across container types.
## Composite Filtering: `IFilterBelongingSubclades`
- **Purpose**: Filters taxa belonging to *any* of a set of specified subclade roots.
- **Behavior**:
- Accepts `*TaxonSet` of clades (roots).
- Uses optimized path for single-clade case: reuses `IFilterOnSubcladeOf`.
- For multiple clades, checks via `IsBelongingSubclades(clades)` in a goroutine.
- Returns original iterator unchanged if input set is empty.
## Design Highlights
- **Iterator-Centric**: All operations are defined on `ITaxon`, promoting chaining and lazy evaluation.
- **Concurrency Support**: Filtering uses goroutines with buffered channels (`source`), enabling asynchronous stream processing.
- **Type Abstraction**: Unified API across `TaxonSet`, `Slice`, and full `Taxonomy` via delegation.
- **Performance Consideration**: Special handling for single-clade case avoids unnecessary iteration overhead.
These methods enable expressive, scalable taxonomic queries—ideal for phylogenetic analysis or biodiversity data pipelines.
+40
View File
@@ -0,0 +1,40 @@
# `obitax` Package: String Interning with Thread-Safe Storage
This Go package (`obitax`) provides a **thread-safe string interner**—a data structure that deduplicates identical strings by storing only one copy per unique value and returning shared references.
## Core Components
- **`InnerString` struct**
Holds:
- `index`: A map from string values to pointers (ensuring identity via pointer equality).
- `lock`: An embedded `sync.RWMutex` to guarantee safe concurrent access.
- **Constructor: `NewInnerString()`**
Initializes an empty interner with a preallocated map.
- **Method: `Innerize(value string) *string`**
- Stores a new unique value (after cloning via `strings.Clone`) if absent.
- Returns the pointer to either:
- The newly interned string, or
- An existing one (if already present).
- Ensures **no duplicate string data** is stored for equal values.
- Fully thread-safe via write lock.
- **Method: `Slice() []string`**
Returns a snapshot of all interned strings as a slice (copying values, not pointers).
- Not safe for concurrent writes during iteration.
- Suitable for inspection or debugging.
## Semantic Use Cases
- **Memory optimization**: Avoid repeated allocation of identical strings (e.g., in parsing, serialization).
- **Pointer-based identity checks**: Use `==` on returned pointers to test string equality efficiently.
- **Concurrent safety**: Designed for use in multi-goroutine environments (e.g., HTTP servers, pipelines).
## Design Notes
- Uses `strings.Clone()` to decouple interned strings from original input lifetimes.
- Interning is **append-only**—no removal mechanism provided (implied by semantics of a simple interner).
- Returns `*string` to enable fast equality comparisons and reduce memory footprint.
> **Note**: This is a minimal, efficient interner—ideal for read-heavy or batched deduplication scenarios.
+19
View File
@@ -0,0 +1,19 @@
# Semantic Description of `obitax` Taxonomic Functions
The `obitax` package provides two core methods for hierarchical taxon relationship analysis:
- **`IsSubCladeOf(parent *Taxon) bool`**
Determines whether the current taxon is a **descendant** (i.e., subclade) of a given parent taxon.
- Ensures both taxa belong to the *same taxonomy*—fails with a fatal log if not.
- Traverses upward via `taxon.IPath()` (iterative ancestor path) to check if any node matches the parents ID.
- Returns `true` iff a match is found, indicating lineage descent.
- **`IsBelongingSubclades(clades *TaxonSet) bool`**
Checks whether the current taxon—or any of its **ancestors**—belongs to a specified set of clades (`TaxonSet`).
- Starts by testing direct membership via `clades.Contains(taxon.Node.id)`.
- Walks upward through the hierarchy (`taxon = taxon.Parent()`) until either:
- A match is found, or
- The root is reached.
- Final check at the root ensures completeness (e.g., if only root belongs).
Both functions support **robust phylogenetic queries**, enabling classification validation, filtering by clade membership, and hierarchical consistency checks in taxonomic trees.
+31
View File
@@ -0,0 +1,31 @@
# Semantic Description of `obitax` Package Functionalities
The `obitax` package provides a robust iterator-based API for traversing taxonomic data structures in Go. Its core component is the `ITaxon` interface, which implements a lazy, concurrent-safe iterator over taxon instances (`*Taxon`). Key features include:
- **Iterator Creation**: `ITaxon` can be instantiated via `NewITaxon()` or derived from collections:
- `TaxonSet.Iterator()`, `TaxonSlice.Iterator()` (sorted), and `Taxonomy.nodes.Iterator()`
- Goroutines feed taxa into a channel, enabling non-blocking iteration.
- **Control Methods**:
- `Next()` advances to the next taxon, returning success/failure.
- `Get()` retrieves the current taxon (must follow a successful `Next`).
- `Finished()` checks if iteration is complete.
- **Channel Management**:
- `Push(taxon)` sends a taxon into the iterators channel.
- `Close()` terminates iteration by closing the source channel.
- **Iterator Composition**:
- `Split()`: creates a new iterator sharing the same source and termination status (useful for parallel consumption).
- `Concat(...)`: merges multiple iterators sequentially into one.
- **Metadata Enrichment**:
- `AddMetadata(name, value)` wraps the iterator to inject metadata into each taxon via `SetMetadata`.
- **Subtree Traversal**:
- `ISubTaxonomy()` (on `*Taxon` or via `Taxonomy.ITaxon(taxid)`) performs a breadth-first traversal of descendant taxa, starting from the current taxon or given ID. It uses parent-child adjacency logic to expand the subtree incrementally.
- **Consumption Utility**:
- `Consume()` exhausts an iterator without processing (e.g., for side-effect-only pipelines).
All iterators are designed to be composable, memory-efficient (via channels), and safe for concurrent use. The package integrates with `obiutils` to manage pipeline registration/unregistration during subtree expansion.
+31
View File
@@ -0,0 +1,31 @@
# Semantic Description of `obitax.LCA()` Functionality
The `LCA` method computes the **Lowest Common Ancestor (LCA)** of two taxonomic entities (`Taxon` instances) within a shared hierarchical taxonomy.
- **Input**: A pointer to another `*Taxon` (`t2`) and the receiver taxon (`t1`).
- **Output**: A `*Taxon` representing their LCA, or an error detailing why computation failed.
### Core Logic
- **Nil Safety**: Handles cases where one or both taxa are `nil`, returning the non-nil taxon (or an error if *both* are nil or lack internal `Node` references).
- **Validation Checks**:
- Ensures both taxa belong to the *same* `Taxonomy`.
- Verifies that the taxonomy is **rooted** (i.e., has a defined root node).
- **Path-Based Traversal**:
- Retrieves the full path from each taxon to the root via `Path()` (assumed to return an ordered list of nodes).
- Traverses both paths *backwards* (from root toward leaves) until divergence is detected.
- The first divergent node marks the boundary; the LCA is the last *common* ancestor (i.e., `slice[i+1]` after loop exit).
### Semantic Meaning
- The LCA represents the most specific taxonomic node that *contains both taxa* in its subtree.
- This operation is foundational for tasks like:
- Taxonomic classification consistency checks,
- Phylogenetic inference (e.g., computing taxon distances),
- Hierarchical aggregation in biodiversity analyses.
### Error Handling
Explicit errors cover:
- Invalid inputs (`nil` taxa, missing nodes),
- Cross-taxonomy queries,
- Unrooted taxonomy (undefined root → no unique LCA possible).
This implementation assumes a **directed acyclic graph** (specifically, a tree) structure for the taxonomy hierarchy.
+41
View File
@@ -0,0 +1,41 @@
# `obitax` Package: Taxon String Parser
The `obitax` package provides a robust parser for structured taxonomic strings used in biodiversity data processing.
## Core Functionality
- **`ParseTaxonString(taxonStr string)`**
Parses strings in the format: `code:taxid [scientific name]@rank`.
- **Input Format Requirements**
- `code`: Taxonomy identifier (e.g., "GBIF", "NCBI")
- `taxid`: Numeric or alphanumeric taxonomic ID (e.g., "123456")
- `scientific name`: Enclosed in square brackets (e.g., "[Homo sapiens]")
- `rank`: Optional taxonomic rank after `@` (e.g., "species", defaults to `"no rank"` if missing)
- **Robustness Features**
- Trims whitespace around all components.
- Handles multiple `@` symbols (returns error).
- Validates bracket pairing and ordering.
- Ensures `code:taxid` contains exactly one colon separator.
- **Error Handling**
Returns descriptive errors for:
- Missing or malformed brackets
- Invalid number of `@` separators
- Absent colon in code:taxid segment
- Empty fields (code, taxid, or scientific name)
- **Use Cases**
Ideal for parsing legacy biodiversity records (e.g., from OBIS, GBIF), where taxon strings are semi-structured and need reliable extraction before indexing or matching against reference databases.
## Example
Input: `"GBIF:248093 [Homo sapiens]@species"`
Output components:
- `code = "GBIF"`
- `taxid = "248093"`
- `scientificName = "Homo sapiens"`
- `rank = "species"`
Returns empty strings and an error for invalid inputs.
+19
View File
@@ -0,0 +1,19 @@
# `obitax` Package: Taxonomic Identifier Handling
The `obitax` package provides a lightweight, type-safe abstraction for handling taxonomic identifiers (`Taxid`) in the OBITools4 ecosystem.
- **`Taxid` type**: A pointer to a string, representing an opaque taxonomic ID (e.g., NCBI TaxID).
- **`TaxidFactory`**: A factory for constructing `Taxid`s from strings or integers, enforcing validation and normalization.
Key features:
- **Code prefix enforcement**: `FromString` validates that the input string starts with a required taxonomy code (e.g., `"tx"`), returning an error otherwise.
- **String parsing**: Automatically strips leading whitespace and extracts the suffix after `':'`.
- **Alphabet filtering**: Uses an ASCII set to extract only valid characters (e.g., digits), ensuring clean, standardized IDs.
- **String interning**: Internally uses `Innerize` (via `InnerString`) to deduplicate strings—improving memory efficiency and comparison speed.
- **Type safety**: `Taxid` is a distinct type (not raw string), reducing misuse and enabling future extension.
Supported conversions:
- `FromString(string)`: Parses `"tx:12345"` → internalized `"12345"`.
- `FromInt(int)`: Converts e.g., `12345` → internalized `"12345"`.
Designed for high-performance pipelines where many taxonomic IDs are processed and reused.
+29
View File
@@ -0,0 +1,29 @@
# `obitax` Package: Taxonomic Data Model and Navigation
The `obitax` package provides a semantic model for representing, querying, and manipulating taxonomic hierarchies in biodiversity data processing. Its core abstraction is the `Taxon` type, which encapsulates both structural (node ID, parent/child relationships) and semantic (scientific name, rank, metadata) information.
### Core Features
- **Taxon Representation**: Each `Taxon` links to a taxonomy and its underlying node, supporting multiple name classes (e.g., "scientific name", "common name"), customizable ranks, and extensible metadata via key-value pairs.
- **String Interoperability**: Implements `String()` for human-readable output (`taxonomy:taxid [name]`) and provides typed accessors like `ScientificName()`, `Rank()`, or `IsRoot()`.
### Name Handling & Matching
- Flexible name retrieval via `Name(class)`, case-insensitive equality (`IsNameEqual`), and regex-based matching (`IsNameMatching`). Names are interned for memory efficiency.
### Hierarchical Navigation
- **Path Traversal**: `IPath()` yields an iterator from current taxon up to root; `Path()` materializes this as a slice. Enables efficient lineage queries.
- **Rank-Based Lookup**: Methods like `TaxonAtRank(rank)`, or convenience wrappers (`Species()`, `Genus()`, `Family()`), allow targeted retrieval of higher-level ancestors.
- **Child Management**: Supports dynamic tree extension via `AddChild()`, parsing taxon strings and enforcing taxonomy consistency.
### Metadata Support
- Rich metadata operations: `SetMetadata`, `GetMetadata`, key/value iteration, and typed conversion (`MetadataAsString`). Enables attaching arbitrary annotations (e.g., confidence scores, source references).
### Robustness & Safety
- Nil-safe accessors prevent panics; logging and error handling ensure correctness (e.g., fatal on missing root in `IPath()`).
- Interning of names/ranks/classes (`Innerize`) reduces duplication and speeds comparisons.
Designed for scalability in large-scale metabarcoding pipelines, `obitax` bridges raw taxonomic data with high-level analytical operations.
+36
View File
@@ -0,0 +1,36 @@
# `obitax` Package: Taxonomic Node Representation and Management
The `obitax` package provides a lightweight, pointer-based Go implementation for representing taxonomic nodes in biological classification systems.
## Core Data Structure
- **`TaxNode`**: Represents a single taxon (e.g., species, genus) with the following fields:
- `id`: Unique taxon identifier (pointer to string).
- `parent`: Identifier of the parent node in the taxonomy hierarchy.
- `rank`: Taxonomic rank (e.g., `"species"`, `"family"`).
- `scientificname`: Canonical scientific name (e.g., *Homo sapiens*).
- `alternatenames`: Map of alternative names keyed by name class (e.g., `"common_name"`, `"synonym"`).
## Key Functionalities
- **String Representation**
`String(taxonomyCode)` returns a formatted label like `"NCBI:12345 [Homo sapiens]@species"` (or raw ID if enabled via `obidefault.UseRawTaxids()`).
- **Accessors**
- `Id()`, `ParentId()`: Retrieve identifiers.
- `ScientificName()` / `Rank()`: Return name or rank (defaulting to `"NA"` if missing).
- `Name(class)`: Fetch name by class (`"scientific name"` or alternate).
- **Mutators**
- `SetName(name, class)`: Assign scientific name or add/update alternate names.
- **Name Matching & Validation**
- `IsNameEqual(name, ignoreCase)`: Exact or case-insensitive match against scientific/alternate names.
- `IsNameMatching(pattern)`: Regex-based pattern matching over all available names.
## Design Notes
- Uses pointers for optional fields (enables `nil` semantics).
- Graceful handling of missing data (`NA`, empty strings, safe dereferencing with `nil` checks).
- Integrates logging via Logrus (`log.Panic` on misuse, e.g., setting name of `nil` node).
- Designed for use in larger OBITools pipelines (e.g., with `obidefault` configuration).
+18
View File
@@ -0,0 +1,18 @@
# `obitax` Package: Taxonomic Data Management
The `obitax` package provides a robust framework for managing hierarchical taxonomic classifications. Its core component is the `Taxonomy` struct, which encapsulates metadata (name, code), taxon identifiers (`ids`, `ranks`), names and name classes (`names`, `nameclasses`), node hierarchy (`nodes`, `root`), indexing for fast lookup, and validation logic.
## Key Functionalities
- **Initialization**: `NewTaxonomy()` creates a new taxonomy with configurable identifier alphabet and initializes internal data structures.
- **Identifier Handling**: `Id()` validates and converts string-based taxon IDs to internal representations; `TaxidString()` retrieves formatted identifiers (e.g., `"code:id [name]"`).
- **Taxon Access**: `Taxon()` fetches a taxon by ID, returning whether it's an alias; `AsTaxonSet()` exposes the full taxonomic node collection.
- **Structure Management**:
- `AddTaxon()` inserts a new taxon with parent, rank, and root flags.
- `AddAlias()` maps alternative IDs to existing taxa (supporting replacement).
- **Metadata Queries**: Methods like `RankList()`, `Name()`, and `Code()` expose taxonomy metadata.
- **Root Control**: `SetRoot()`/`Root()` manage the root node; `HasRoot()` checks its presence.
- **Path Insertion**: `InsertPathString()` builds or extends a taxonomy from an ordered list of taxon strings, enforcing parent-child consistency.
- **Phylogenetic Export**: `AsPhyloTree()` converts the taxonomy into a phylogeny-compatible tree (`obiphylo.PhyloNode`), enabling downstream evolutionary analysis.
All operations gracefully handle `nil` receivers via an internal `.OrDefault()` helper, ensuring safe usage in pipelines. Error reporting is explicit and contextualized (e.g., duplicate taxon, missing parent).
+24
View File
@@ -0,0 +1,24 @@
# TaxonSet: Semantic Description of Functionality
The `TaxonSet` type manages a collection of taxonomic entities within a hierarchical taxonomy system. It stores mappings from unique identifiers (pointers to strings) to `TaxNode` instances, supporting both canonical taxa and aliases.
- **Construction**: Created via `(Taxonomy).NewTaxonSet()`, initializing an empty set and linking it to a specific taxonomy.
- **Basic Queries**:
- `Get(id)`: Retrieves the corresponding taxon (or nil).
- `Len()`: Returns count of *unique* taxa, excluding aliases.
- `Contains(id)`, `IsATaxon(id)`, and `IsAlias(id)` enable precise taxon/alias distinction.
- **Insertion & Management**:
- `Insert(node)`: Adds or updates a taxon node.
- `InsertTaxon(taxon)`: Safe insertion with taxonomy validation; auto-creates set if nil.
- `Alias(id, taxon)`: Registers an alias (non-canonical ID pointing to a real node), incrementing internal `nalias` counter.
- **Hierarchy & Iteration**:
- `Sort()`: Returns a topologically sorted slice of taxa (parents before children), respecting tree structure.
- `Taxonomy()`: Provides access to the parent taxonomy.
- **Phylogenetic Export**:
- `AsPhyloTree(root)`: Converts the set into a rooted phylogenetic tree (`obiphylo.PhyloNode`), embedding taxon names, ranks, and parent relationships as node attributes.
In essence, `TaxonSet` enables efficient storage, lookup, validation, and structural manipulation of taxonomic data—supporting both biological classification logic (e.g., alias resolution, hierarchy traversal) and downstream interoperability with phylogenetic tools.
+25
View File
@@ -0,0 +1,25 @@
# `obitax` Package: Taxonomic Data Handling
The `obitax` package provides structured support for managing collections of taxon nodes in a biological taxonomy.
- **Core Type**: `TaxonSlice` encapsulates an ordered list of `*TaxNode`s and a reference to their parent `Taxonomy`.
- **Construction**: Created via `(taxonomy *Taxonomy).NewTaxonSlice(size, capacity)`, initializing a typed slice with optional pre-allocation.
- **Accessors**:
- `Get(i int) *TaxNode`: retrieves the raw node at index.
- `Taxon(i int) *Taxon`: wraps a node with its taxonomy context, enabling richer operations.
- `Len() int`: returns the current number of nodes.
- **Mutation Methods**:
- `Set(index, taxon)`: replaces a node at given index (taxonomy-mismatch panics).
- `Push(taxon)`: appends a taxon to the end (also enforces taxonomy consistency).
- `ReduceToSize(n)`: truncates slice to first *n* elements.
- **Utility Features**:
- `Reverse(inplace)`: reverses node order — either in-place or as a new slice.
- `String() string`: formats the entire path as `"id@sci_name@rank"` entries, separated by `|`, in *reverse* (leaf-to-root) order — ideal for lineage strings.
- **Safety & Semantics**:
- Nil-safety in all methods (returns `nil` or zero).
- Enforces taxonomy coherence: mixing taxa from different taxonomies triggers a panic.
This package enables efficient, type-safe manipulation of hierarchical biological classification paths (e.g., for sequence annotation or metabarcoding output).