Files
obitools4/autodoc/docmd/pkg_obiapat.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

89 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obiapat`: High-Performance Approximate Pattern Matching for Biological Sequences
The `obiapat` Go package delivers **fast, memory-safe approximate pattern matching** over biological sequences (DNA/RNA), leveraging a C-based implementation of the **Apat algorithm**. Designed for NGS preprocessing (e.g., primer detection, adapter trimming), it supports fuzzy matching with mismatches/indels, reverse-complement search, circular topology handling, and efficient non-overlapping match filtering—all while integrating seamlessly with the OBITools4 ecosystem.
## Core Concepts
- **`ApatPattern`**: Compiled pattern object (≤64 bp) supporting:
- IUPAC ambiguity codes (`W`, `R`, `[AT]`)
- Negated bases (`!A` = "not A")
- Fixed-position anchors (`#`)
- **`ApatSequence`**: Lightweight wrapper around `obiseq.BioSequence`, enabling optimized pattern scanning with optional circular indexing and memory recycling.
## Public API
### Pattern Construction & Transformation
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool) (*ApatPattern, error)`**
Compiles a pattern string into an executable automaton. Supports:
- `errormax`: Max allowed errors (substitutions only if `allowsIndel=false`; indels included otherwise).
- Pattern syntax: e.g., `"A[T]C!GT#"` → matches "A", then any A/T, then C, allows 1 mismatch at position `!G`, requires exact match at anchored `#T`.
- **`ReverseComplement() *ApatPattern`**
Returns a new pattern representing the reverse complement (essential for strand-agnostic DNA searches).
- **`Len() int`**
Returns the patterns length in bases.
### Matching & Search Operations
- **`FindAllIndex(seq *ApatSequence, start, end int) [][3]int`**
Returns all valid matches in `[start_pos, end_pos, error_count]` format within `seq[start:end)`.
- Supports partial sequence scans (e.g., for sliding windows).
- **`IsMatching(seq *ApatSequence, start, end int) bool`**
Fast boolean check: does the pattern match *anywhere* in `seq[start:end)` within error tolerance?
- **`BestMatch(seq *ApatSequence, start, end int) (start, end, errors int)`**
Finds the *lowest-error* match in a region. For indel patterns, performs local realignment to refine alignment boundaries.
- **`FilterBestMatch(seq *ApatSequence, start, end int) [][3]int`**
Returns **non-overlapping matches**, prioritizing lower-error occurrences (greedy selection from best to worst).
- **`AllMatches(seq *ApatSequence, start, end int) [][3]int`**
Computes all valid matches (including indel-aware realignment), then filters to non-overlapping set using `FilterBestMatch`.
### Resource Management
- **`Free()`**
Explicitly releases C-level resources. Finalizers auto-cleanup, but manual `Free()` is recommended in hot loops for predictable memory use.
## PCR Simulation Module (`PCRSim` family)
Implements *in silico* PCR with configurable primer tolerance and amplicon constraints:
- **`PCRSim(seq obiseq.BioSequence, opts ...Option) []Amplicon`**
Simulates PCR on a single sequence. Options include:
- `OptionForwardPrimer(pattern string, errormax int)` / `OptionReversePrimer(...)`
- `OptionMinLength(n)`, `OptionMaxLength(n)` → filter amplicons by size
- `OptionWithExtension(len int, strict bool)` → add flanking regions (trim if `strict=false`)
- `OptionCircular(bool)` → handle circular DNA topology
- **`PCRSlice(seqs []obiseq.BioSequence, opts ...Option) [][]Amplicon`**
Batch PCR across multiple sequences.
- **`PCRSliceWorker(opts ...Option) func(int, obiseq.BioSequence) (int, interface{})`**
Returns a reusable worker for parallel execution via `obiseq.MakeISliceWorker`.
### Output Format
Each amplicon includes:
- Coordinates, primer positions/errors/directions
- Flanking extensions (if requested)
- Original sequence metadata preserved
## Predicate Generator: `IsPatternMatchSequence`
Returns a **reusable function** for sequence filtering:
```go
func IsPatternMatchSequence(
pattern string, errormax int,
bothStrand bool, allowIndel bool
) obiseq.SequencePredicate
```
- Internally builds `ApatPattern` + reverse complement (if needed).
- Predicate logic:
```go
func(seq *obiseq.BioSequence) bool {
return pattern.IsMatching(...) || (!bothStrand && false)
|| rcPattern.IsMatching(...)
}
```
- Ideal for high-throughput read filtering (e.g., barcode detection, primer contamination checks).
## Implementation Highlights
- **C interoperability** via `cgo` with custom memory management (no Go heap copies).
- **Finalizers + manual `Free()`** prevent leaks in long-running pipelines.
- Uses `unsafe.SliceData` for zero-copy sequence access during matching.
- Logging via **Logrus** (errors at `ErrorLevel`, debug amplicon details at `DebugLevel`).