mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,88 @@
|
||||
# `obiapat`: High-Performance Approximate Pattern Matching for Biological Sequences
|
||||
|
||||
The `obiapat` Go package delivers **fast, memory-safe approximate pattern matching** over biological sequences (DNA/RNA), leveraging a C-based implementation of the **Apat algorithm**. Designed for NGS preprocessing (e.g., primer detection, adapter trimming), it supports fuzzy matching with mismatches/indels, reverse-complement search, circular topology handling, and efficient non-overlapping match filtering—all while integrating seamlessly with the OBITools4 ecosystem.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **`ApatPattern`**: Compiled pattern object (≤64 bp) supporting:
|
||||
- IUPAC ambiguity codes (`W`, `R`, `[AT]`)
|
||||
- Negated bases (`!A` = "not A")
|
||||
- Fixed-position anchors (`#`)
|
||||
- **`ApatSequence`**: Lightweight wrapper around `obiseq.BioSequence`, enabling optimized pattern scanning with optional circular indexing and memory recycling.
|
||||
|
||||
## Public API
|
||||
|
||||
### Pattern Construction & Transformation
|
||||
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool) (*ApatPattern, error)`**
|
||||
Compiles a pattern string into an executable automaton. Supports:
|
||||
- `errormax`: Max allowed errors (substitutions only if `allowsIndel=false`; indels included otherwise).
|
||||
- Pattern syntax: e.g., `"A[T]C!GT#"` → matches "A", then any A/T, then C, allows 1 mismatch at position `!G`, requires exact match at anchored `#T`.
|
||||
- **`ReverseComplement() *ApatPattern`**
|
||||
Returns a new pattern representing the reverse complement (essential for strand-agnostic DNA searches).
|
||||
- **`Len() int`**
|
||||
Returns the pattern’s length in bases.
|
||||
|
||||
### Matching & Search Operations
|
||||
|
||||
- **`FindAllIndex(seq *ApatSequence, start, end int) [][3]int`**
|
||||
Returns all valid matches in `[start_pos, end_pos, error_count]` format within `seq[start:end)`.
|
||||
- Supports partial sequence scans (e.g., for sliding windows).
|
||||
- **`IsMatching(seq *ApatSequence, start, end int) bool`**
|
||||
Fast boolean check: does the pattern match *anywhere* in `seq[start:end)` within error tolerance?
|
||||
- **`BestMatch(seq *ApatSequence, start, end int) (start, end, errors int)`**
|
||||
Finds the *lowest-error* match in a region. For indel patterns, performs local realignment to refine alignment boundaries.
|
||||
- **`FilterBestMatch(seq *ApatSequence, start, end int) [][3]int`**
|
||||
Returns **non-overlapping matches**, prioritizing lower-error occurrences (greedy selection from best to worst).
|
||||
- **`AllMatches(seq *ApatSequence, start, end int) [][3]int`**
|
||||
Computes all valid matches (including indel-aware realignment), then filters to non-overlapping set using `FilterBestMatch`.
|
||||
|
||||
### Resource Management
|
||||
- **`Free()`**
|
||||
Explicitly releases C-level resources. Finalizers auto-cleanup, but manual `Free()` is recommended in hot loops for predictable memory use.
|
||||
|
||||
## PCR Simulation Module (`PCRSim` family)
|
||||
|
||||
Implements *in silico* PCR with configurable primer tolerance and amplicon constraints:
|
||||
|
||||
- **`PCRSim(seq obiseq.BioSequence, opts ...Option) []Amplicon`**
|
||||
Simulates PCR on a single sequence. Options include:
|
||||
- `OptionForwardPrimer(pattern string, errormax int)` / `OptionReversePrimer(...)`
|
||||
- `OptionMinLength(n)`, `OptionMaxLength(n)` → filter amplicons by size
|
||||
- `OptionWithExtension(len int, strict bool)` → add flanking regions (trim if `strict=false`)
|
||||
- `OptionCircular(bool)` → handle circular DNA topology
|
||||
- **`PCRSlice(seqs []obiseq.BioSequence, opts ...Option) [][]Amplicon`**
|
||||
Batch PCR across multiple sequences.
|
||||
- **`PCRSliceWorker(opts ...Option) func(int, obiseq.BioSequence) (int, interface{})`**
|
||||
Returns a reusable worker for parallel execution via `obiseq.MakeISliceWorker`.
|
||||
|
||||
### Output Format
|
||||
Each amplicon includes:
|
||||
- Coordinates, primer positions/errors/directions
|
||||
- Flanking extensions (if requested)
|
||||
- Original sequence metadata preserved
|
||||
|
||||
## Predicate Generator: `IsPatternMatchSequence`
|
||||
|
||||
Returns a **reusable function** for sequence filtering:
|
||||
```go
|
||||
func IsPatternMatchSequence(
|
||||
pattern string, errormax int,
|
||||
bothStrand bool, allowIndel bool
|
||||
) obiseq.SequencePredicate
|
||||
```
|
||||
- Internally builds `ApatPattern` + reverse complement (if needed).
|
||||
- Predicate logic:
|
||||
```go
|
||||
func(seq *obiseq.BioSequence) bool {
|
||||
return pattern.IsMatching(...) || (!bothStrand && false)
|
||||
|| rcPattern.IsMatching(...)
|
||||
}
|
||||
```
|
||||
- Ideal for high-throughput read filtering (e.g., barcode detection, primer contamination checks).
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- **C interoperability** via `cgo` with custom memory management (no Go heap copies).
|
||||
- **Finalizers + manual `Free()`** prevent leaks in long-running pipelines.
|
||||
- Uses `unsafe.SliceData` for zero-copy sequence access during matching.
|
||||
- Logging via **Logrus** (errors at `ErrorLevel`, debug amplicon details at `DebugLevel`).
|
||||
Reference in New Issue
Block a user