mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
29 lines
1.5 KiB
Markdown
29 lines
1.5 KiB
Markdown
|
|
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
|
||
|
|
|
||
|
|
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
|
||
|
|
|
||
|
|
## Core Components
|
||
|
|
|
||
|
|
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
|
||
|
|
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
|
||
|
|
- Scans each sequence *forward* and its reverse complement.
|
||
|
|
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
|
||
|
|
- Attaches match counts as sequence attributes.
|
||
|
|
|
||
|
|
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
|
||
|
|
Compiles a *single* matcher and returns a predicate function.
|
||
|
|
- Returns `true` if the number of matches ≥ `minMatches`.
|
||
|
|
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
|
||
|
|
|
||
|
|
## Technical Highlights
|
||
|
|
|
||
|
|
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
|
||
|
|
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
|
||
|
|
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
|
||
|
|
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
|
||
|
|
|
||
|
|
## Use Cases
|
||
|
|
|
||
|
|
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
|
||
|
|
- Filtering or annotating sequences based on pattern presence/abundance.
|