Files
obitools4/autodoc/docmd/pkg_obicorazick.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

71 lines
3.4 KiB
Markdown

# `obicorazick`: Aho-Corasick-Based Sequence Analysis Package
`obicorazick` is a high-performance Go library for rapid pattern detection in biological sequences (e.g., FASTA/FASTQ), designed to scale efficiently with large pattern sets. Built on the Aho-Corasick algorithm, it enables concurrent scanning of sequences against thousands to millions of patterns—ideal for primer screening, contamination checks, or taxonomic classification.
## Public API
### `AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`
Constructs a **sequence worker function** that scans input sequences for matches against the provided `patterns`, using *multiple* Aho-Corasick automata compiled in parallel (batched internally to manage memory).
- **Input**:
- `slot`: Name of the attribute field where match counts will be stored (e.g., `"primer_hits"`).
- `patterns`: List of DNA/RNA patterns (strings) to search for.
- **Behavior**:
- Splits `patterns` into batches of ≤10⁷ items (configurable via environment).
- Compiles one Aho-Corasick matcher per batch in parallel (using `obidefault.ParallelWorkers()`).
- For each sequence: scans both the forward strand and its reverse complement.
- Records three counts as attributes on the sequence:
```text
<slot> → total matches (forward + rev-comp)
<slot>_Fwd → forward-strand-only matches
<slot>_Rev → rev-comp-specific (i.e., not found on forward) matches
```
- Logs match counts at debug level (via Logrus).
- **Use case**: Annotating sequences with pattern-hit statistics for downstream analysis (e.g., reporting primer coverage per read).
---
### `AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`
Returns a **boolean predicate function** that tests whether sequences contain ≥ `minMatches` occurrences of any pattern.
- **Input**:
- `minMatches`: Minimum number of total matches required to pass the predicate.
- `patterns`: List of patterns (same format as above).
- **Behavior**:
- Compiles a *single* Aho-Corasick matcher (no batching—assumes pattern set is moderate-sized or memory-safe).
- Scans only the forward strand (for efficiency in filtering contexts where rev-comp is unnecessary).
- Returns `true` if match count ≥ `minMatches`; otherwise `false`.
- **Use case**: Filtering sequences—e.g., retain only reads containing ≥2 barcode primers, or discard those matching known contaminants.
---
## Implementation Notes (Non-Exported)
While not part of the public API, internal behavior includes:
- **Batching logic**: Splits patterns to avoid memory exhaustion during automaton construction.
- **Parallel compilation**: Uses goroutines + sync.WaitGroup, respecting `GOMAXPROCS`.
- **Progress feedback**: Optional CLI progress bar (via `progressbar/v3`) when enabled globally.
- **Logging**: Info/debug messages via Logrus (e.g., “Built 3 matchers in parallel” or “Sequence X: 5 total matches”).
## Typical Workflows
1. **Annotation pipeline**:
```go
worker := AhoCorazickWorker("contam", contaminantDB)
annotatedSeqs := obiseq.Map(worker, inputSequences)
```
2. **Filtering pipeline**:
```go
filter := AhoCorazickPredicate(1, barcodePatterns)
filteredSeqs := obiseq.Filter(filter, inputSequences)
```
Designed for speed and memory efficiency in large-scale NGS data processing.