⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+28
View File
@@ -0,0 +1,28 @@
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
## Core Components
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
- Scans each sequence *forward* and its reverse complement.
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
- Attaches match counts as sequence attributes.
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
Compiles a *single* matcher and returns a predicate function.
- Returns `true` if the number of matches ≥ `minMatches`.
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
## Technical Highlights
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
## Use Cases
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
- Filtering or annotating sequences based on pattern presence/abundance.