Files
obitools4/autodoc/docmd/pkg/obicorazick/worker.md
T

29 lines
1.5 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
## Core Components
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
- Scans each sequence *forward* and its reverse complement.
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
- Attaches match counts as sequence attributes.
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
Compiles a *single* matcher and returns a predicate function.
- Returns `true` if the number of matches ≥ `minMatches`.
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
## Technical Highlights
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
## Use Cases
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
- Filtering or annotating sequences based on pattern presence/abundance.