Files
obitools4/autodoc/docmd/pkg_obicorazick.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.4 KiB

obicorazick: Aho-Corasick-Based Sequence Analysis Package

obicorazick is a high-performance Go library for rapid pattern detection in biological sequences (e.g., FASTA/FASTQ), designed to scale efficiently with large pattern sets. Built on the Aho-Corasick algorithm, it enables concurrent scanning of sequences against thousands to millions of patterns—ideal for primer screening, contamination checks, or taxonomic classification.

Public API

AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker

Constructs a sequence worker function that scans input sequences for matches against the provided patterns, using multiple Aho-Corasick automata compiled in parallel (batched internally to manage memory).

  • Input:

    • slot: Name of the attribute field where match counts will be stored (e.g., "primer_hits").
    • patterns: List of DNA/RNA patterns (strings) to search for.
  • Behavior:

    • Splits patterns into batches of ≤10⁷ items (configurable via environment).
    • Compiles one Aho-Corasick matcher per batch in parallel (using obidefault.ParallelWorkers()).
    • For each sequence: scans both the forward strand and its reverse complement.
    • Records three counts as attributes on the sequence:
      <slot>         → total matches (forward + rev-comp)
      <slot>_Fwd     → forward-strand-only matches
      <slot>_Rev     → rev-comp-specific (i.e., not found on forward) matches
      
    • Logs match counts at debug level (via Logrus).
  • Use case: Annotating sequences with pattern-hit statistics for downstream analysis (e.g., reporting primer coverage per read).


AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate

Returns a boolean predicate function that tests whether sequences contain ≥ minMatches occurrences of any pattern.

  • Input:

    • minMatches: Minimum number of total matches required to pass the predicate.
    • patterns: List of patterns (same format as above).
  • Behavior:

    • Compiles a single Aho-Corasick matcher (no batching—assumes pattern set is moderate-sized or memory-safe).
    • Scans only the forward strand (for efficiency in filtering contexts where rev-comp is unnecessary).
    • Returns true if match count ≥ minMatches; otherwise false.
  • Use case: Filtering sequences—e.g., retain only reads containing ≥2 barcode primers, or discard those matching known contaminants.


Implementation Notes (Non-Exported)

While not part of the public API, internal behavior includes:

  • Batching logic: Splits patterns to avoid memory exhaustion during automaton construction.
  • Parallel compilation: Uses goroutines + sync.WaitGroup, respecting GOMAXPROCS.
  • Progress feedback: Optional CLI progress bar (via progressbar/v3) when enabled globally.
  • Logging: Info/debug messages via Logrus (e.g., “Built 3 matchers in parallel” or “Sequence X: 5 total matches”).

Typical Workflows

  1. Annotation pipeline:

    worker := AhoCorazickWorker("contam", contaminantDB)
    annotatedSeqs := obiseq.Map(worker, inputSequences)
    
  2. Filtering pipeline:

    filter := AhoCorazickPredicate(1, barcodePatterns)
    filteredSeqs := obiseq.Filter(filter, inputSequences)
    

Designed for speed and memory efficiency in large-scale NGS data processing.