mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
89 lines
4.7 KiB
Markdown
89 lines
4.7 KiB
Markdown
# `obiapat`: High-Performance Approximate Pattern Matching for Biological Sequences
|
||
|
||
The `obiapat` Go package delivers **fast, memory-safe approximate pattern matching** over biological sequences (DNA/RNA), leveraging a C-based implementation of the **Apat algorithm**. Designed for NGS preprocessing (e.g., primer detection, adapter trimming), it supports fuzzy matching with mismatches/indels, reverse-complement search, circular topology handling, and efficient non-overlapping match filtering—all while integrating seamlessly with the OBITools4 ecosystem.
|
||
|
||
## Core Concepts
|
||
|
||
- **`ApatPattern`**: Compiled pattern object (≤64 bp) supporting:
|
||
- IUPAC ambiguity codes (`W`, `R`, `[AT]`)
|
||
- Negated bases (`!A` = "not A")
|
||
- Fixed-position anchors (`#`)
|
||
- **`ApatSequence`**: Lightweight wrapper around `obiseq.BioSequence`, enabling optimized pattern scanning with optional circular indexing and memory recycling.
|
||
|
||
## Public API
|
||
|
||
### Pattern Construction & Transformation
|
||
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool) (*ApatPattern, error)`**
|
||
Compiles a pattern string into an executable automaton. Supports:
|
||
- `errormax`: Max allowed errors (substitutions only if `allowsIndel=false`; indels included otherwise).
|
||
- Pattern syntax: e.g., `"A[T]C!GT#"` → matches "A", then any A/T, then C, allows 1 mismatch at position `!G`, requires exact match at anchored `#T`.
|
||
- **`ReverseComplement() *ApatPattern`**
|
||
Returns a new pattern representing the reverse complement (essential for strand-agnostic DNA searches).
|
||
- **`Len() int`**
|
||
Returns the pattern’s length in bases.
|
||
|
||
### Matching & Search Operations
|
||
|
||
- **`FindAllIndex(seq *ApatSequence, start, end int) [][3]int`**
|
||
Returns all valid matches in `[start_pos, end_pos, error_count]` format within `seq[start:end)`.
|
||
- Supports partial sequence scans (e.g., for sliding windows).
|
||
- **`IsMatching(seq *ApatSequence, start, end int) bool`**
|
||
Fast boolean check: does the pattern match *anywhere* in `seq[start:end)` within error tolerance?
|
||
- **`BestMatch(seq *ApatSequence, start, end int) (start, end, errors int)`**
|
||
Finds the *lowest-error* match in a region. For indel patterns, performs local realignment to refine alignment boundaries.
|
||
- **`FilterBestMatch(seq *ApatSequence, start, end int) [][3]int`**
|
||
Returns **non-overlapping matches**, prioritizing lower-error occurrences (greedy selection from best to worst).
|
||
- **`AllMatches(seq *ApatSequence, start, end int) [][3]int`**
|
||
Computes all valid matches (including indel-aware realignment), then filters to non-overlapping set using `FilterBestMatch`.
|
||
|
||
### Resource Management
|
||
- **`Free()`**
|
||
Explicitly releases C-level resources. Finalizers auto-cleanup, but manual `Free()` is recommended in hot loops for predictable memory use.
|
||
|
||
## PCR Simulation Module (`PCRSim` family)
|
||
|
||
Implements *in silico* PCR with configurable primer tolerance and amplicon constraints:
|
||
|
||
- **`PCRSim(seq obiseq.BioSequence, opts ...Option) []Amplicon`**
|
||
Simulates PCR on a single sequence. Options include:
|
||
- `OptionForwardPrimer(pattern string, errormax int)` / `OptionReversePrimer(...)`
|
||
- `OptionMinLength(n)`, `OptionMaxLength(n)` → filter amplicons by size
|
||
- `OptionWithExtension(len int, strict bool)` → add flanking regions (trim if `strict=false`)
|
||
- `OptionCircular(bool)` → handle circular DNA topology
|
||
- **`PCRSlice(seqs []obiseq.BioSequence, opts ...Option) [][]Amplicon`**
|
||
Batch PCR across multiple sequences.
|
||
- **`PCRSliceWorker(opts ...Option) func(int, obiseq.BioSequence) (int, interface{})`**
|
||
Returns a reusable worker for parallel execution via `obiseq.MakeISliceWorker`.
|
||
|
||
### Output Format
|
||
Each amplicon includes:
|
||
- Coordinates, primer positions/errors/directions
|
||
- Flanking extensions (if requested)
|
||
- Original sequence metadata preserved
|
||
|
||
## Predicate Generator: `IsPatternMatchSequence`
|
||
|
||
Returns a **reusable function** for sequence filtering:
|
||
```go
|
||
func IsPatternMatchSequence(
|
||
pattern string, errormax int,
|
||
bothStrand bool, allowIndel bool
|
||
) obiseq.SequencePredicate
|
||
```
|
||
- Internally builds `ApatPattern` + reverse complement (if needed).
|
||
- Predicate logic:
|
||
```go
|
||
func(seq *obiseq.BioSequence) bool {
|
||
return pattern.IsMatching(...) || (!bothStrand && false)
|
||
|| rcPattern.IsMatching(...)
|
||
}
|
||
```
|
||
- Ideal for high-throughput read filtering (e.g., barcode detection, primer contamination checks).
|
||
|
||
## Implementation Highlights
|
||
|
||
- **C interoperability** via `cgo` with custom memory management (no Go heap copies).
|
||
- **Finalizers + manual `Free()`** prevent leaks in long-running pipelines.
|
||
- Uses `unsafe.SliceData` for zero-copy sequence access during matching.
|
||
- Logging via **Logrus** (errors at `ErrorLevel`, debug amplicon details at `DebugLevel`).
|