mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
89 lines
4.7 KiB
Markdown
89 lines
4.7 KiB
Markdown
|
|
# `obiapat`: High-Performance Approximate Pattern Matching for Biological Sequences
|
|||
|
|
|
|||
|
|
The `obiapat` Go package delivers **fast, memory-safe approximate pattern matching** over biological sequences (DNA/RNA), leveraging a C-based implementation of the **Apat algorithm**. Designed for NGS preprocessing (e.g., primer detection, adapter trimming), it supports fuzzy matching with mismatches/indels, reverse-complement search, circular topology handling, and efficient non-overlapping match filtering—all while integrating seamlessly with the OBITools4 ecosystem.
|
|||
|
|
|
|||
|
|
## Core Concepts
|
|||
|
|
|
|||
|
|
- **`ApatPattern`**: Compiled pattern object (≤64 bp) supporting:
|
|||
|
|
- IUPAC ambiguity codes (`W`, `R`, `[AT]`)
|
|||
|
|
- Negated bases (`!A` = "not A")
|
|||
|
|
- Fixed-position anchors (`#`)
|
|||
|
|
- **`ApatSequence`**: Lightweight wrapper around `obiseq.BioSequence`, enabling optimized pattern scanning with optional circular indexing and memory recycling.
|
|||
|
|
|
|||
|
|
## Public API
|
|||
|
|
|
|||
|
|
### Pattern Construction & Transformation
|
|||
|
|
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool) (*ApatPattern, error)`**
|
|||
|
|
Compiles a pattern string into an executable automaton. Supports:
|
|||
|
|
- `errormax`: Max allowed errors (substitutions only if `allowsIndel=false`; indels included otherwise).
|
|||
|
|
- Pattern syntax: e.g., `"A[T]C!GT#"` → matches "A", then any A/T, then C, allows 1 mismatch at position `!G`, requires exact match at anchored `#T`.
|
|||
|
|
- **`ReverseComplement() *ApatPattern`**
|
|||
|
|
Returns a new pattern representing the reverse complement (essential for strand-agnostic DNA searches).
|
|||
|
|
- **`Len() int`**
|
|||
|
|
Returns the pattern’s length in bases.
|
|||
|
|
|
|||
|
|
### Matching & Search Operations
|
|||
|
|
|
|||
|
|
- **`FindAllIndex(seq *ApatSequence, start, end int) [][3]int`**
|
|||
|
|
Returns all valid matches in `[start_pos, end_pos, error_count]` format within `seq[start:end)`.
|
|||
|
|
- Supports partial sequence scans (e.g., for sliding windows).
|
|||
|
|
- **`IsMatching(seq *ApatSequence, start, end int) bool`**
|
|||
|
|
Fast boolean check: does the pattern match *anywhere* in `seq[start:end)` within error tolerance?
|
|||
|
|
- **`BestMatch(seq *ApatSequence, start, end int) (start, end, errors int)`**
|
|||
|
|
Finds the *lowest-error* match in a region. For indel patterns, performs local realignment to refine alignment boundaries.
|
|||
|
|
- **`FilterBestMatch(seq *ApatSequence, start, end int) [][3]int`**
|
|||
|
|
Returns **non-overlapping matches**, prioritizing lower-error occurrences (greedy selection from best to worst).
|
|||
|
|
- **`AllMatches(seq *ApatSequence, start, end int) [][3]int`**
|
|||
|
|
Computes all valid matches (including indel-aware realignment), then filters to non-overlapping set using `FilterBestMatch`.
|
|||
|
|
|
|||
|
|
### Resource Management
|
|||
|
|
- **`Free()`**
|
|||
|
|
Explicitly releases C-level resources. Finalizers auto-cleanup, but manual `Free()` is recommended in hot loops for predictable memory use.
|
|||
|
|
|
|||
|
|
## PCR Simulation Module (`PCRSim` family)
|
|||
|
|
|
|||
|
|
Implements *in silico* PCR with configurable primer tolerance and amplicon constraints:
|
|||
|
|
|
|||
|
|
- **`PCRSim(seq obiseq.BioSequence, opts ...Option) []Amplicon`**
|
|||
|
|
Simulates PCR on a single sequence. Options include:
|
|||
|
|
- `OptionForwardPrimer(pattern string, errormax int)` / `OptionReversePrimer(...)`
|
|||
|
|
- `OptionMinLength(n)`, `OptionMaxLength(n)` → filter amplicons by size
|
|||
|
|
- `OptionWithExtension(len int, strict bool)` → add flanking regions (trim if `strict=false`)
|
|||
|
|
- `OptionCircular(bool)` → handle circular DNA topology
|
|||
|
|
- **`PCRSlice(seqs []obiseq.BioSequence, opts ...Option) [][]Amplicon`**
|
|||
|
|
Batch PCR across multiple sequences.
|
|||
|
|
- **`PCRSliceWorker(opts ...Option) func(int, obiseq.BioSequence) (int, interface{})`**
|
|||
|
|
Returns a reusable worker for parallel execution via `obiseq.MakeISliceWorker`.
|
|||
|
|
|
|||
|
|
### Output Format
|
|||
|
|
Each amplicon includes:
|
|||
|
|
- Coordinates, primer positions/errors/directions
|
|||
|
|
- Flanking extensions (if requested)
|
|||
|
|
- Original sequence metadata preserved
|
|||
|
|
|
|||
|
|
## Predicate Generator: `IsPatternMatchSequence`
|
|||
|
|
|
|||
|
|
Returns a **reusable function** for sequence filtering:
|
|||
|
|
```go
|
|||
|
|
func IsPatternMatchSequence(
|
|||
|
|
pattern string, errormax int,
|
|||
|
|
bothStrand bool, allowIndel bool
|
|||
|
|
) obiseq.SequencePredicate
|
|||
|
|
```
|
|||
|
|
- Internally builds `ApatPattern` + reverse complement (if needed).
|
|||
|
|
- Predicate logic:
|
|||
|
|
```go
|
|||
|
|
func(seq *obiseq.BioSequence) bool {
|
|||
|
|
return pattern.IsMatching(...) || (!bothStrand && false)
|
|||
|
|
|| rcPattern.IsMatching(...)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
- Ideal for high-throughput read filtering (e.g., barcode detection, primer contamination checks).
|
|||
|
|
|
|||
|
|
## Implementation Highlights
|
|||
|
|
|
|||
|
|
- **C interoperability** via `cgo` with custom memory management (no Go heap copies).
|
|||
|
|
- **Finalizers + manual `Free()`** prevent leaks in long-running pipelines.
|
|||
|
|
- Uses `unsafe.SliceData` for zero-copy sequence access during matching.
|
|||
|
|
- Logging via **Logrus** (errors at `ErrorLevel`, debug amplicon details at `DebugLevel`).
|