⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+25
View File
@@ -0,0 +1,25 @@
# Apat Package: Pattern Matching for Biological Sequences
The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
## Core Types
- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).
- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
## Key Functions & Methods
- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.
- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).
- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.
- `IsMatching(...)`: Boolean check for presence of at least one match in a region.
- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.
- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.
- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).
- `Free()`, `Len()`: Explicit memory cleanup and length queries.
## Implementation Notes
Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
+32
View File
@@ -0,0 +1,32 @@
# Semantic Description of `obiapat` Package Functionality
The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
## Core Functionality
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**
Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
- `pattern`: A string where:
- Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
- Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
- Exclamation `!` marks positions where **errors** (substitutions) are permitted.
- `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
- `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
## Behavior & Semantics
- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
- Supports three modes:
- **Exact matching** (`errormax = 0`, `allowsIndel = false`).
- **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
- **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
## Testing Coverage
The provided test suite validates:
- Valid pattern parsing across different configurations.
- Correct handling of `nil` vs. non-nil output pointers.
- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
+27
View File
@@ -0,0 +1,27 @@
# PCR Simulation Module (`obiapat`)
This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
## Key Functionalities
- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
## Core API
- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
## Implementation Details
- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
- Handles circular topology by wrapping indices around sequence boundaries.
- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
- Logs critical errors with `logrus`; debug-level details for amplicon generation.
Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
+23
View File
@@ -0,0 +1,23 @@
## Semantic Description of `IsPatternMatchSequence`
The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
### Core Functionality:
- **Input Parameters**
- `pattern`: A regular expression-like string describing the target pattern.
- `errormax`: Maximum allowed mismatches (substitutions only by default).
- `bothStrand`: If true, also search on the reverse-complement strand.
- `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
- **Internal Workflow**
- Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.
- Computes its reverse complement for dual-strand matching.
- Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
- **Matching Logic**
- Converts input sequence to `apat` format.
- Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.
- Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
### Semantic Use Case:
Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.