Files
obitools4/autodoc/docmd/pkg_obitools_obisplit.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.6 KiB
Raw Blame History

obisplit Package: Semantic Description

The obisplit package enables targeted splitting of biological sequences using user-defined pattern pairs (e.g., primers, barcodes), supporting approximate matching and robust annotation of resulting fragments—ideal for demultiplexing in metabarcoding or amplicon sequencing pipelines.

Core Concepts

  • SplitSequence: Represents a pattern pair (forward/reverse) with an associated group name. Used to define searchable molecular tags.
  • Pattern_match: Encapsulates a detected pattern instance, including name, genomic coordinates (1-based), error count, and orientation.

Pattern Detection (LocatePatterns)

Scans a sequence for all forward/reverse pattern occurrences using fuzzy matching (mismatches and optionally indels):

  • Accepts raw or indexed sequences for efficient lookup.
  • Detects matches with configurable error tolerance (default: ≤4 mismatches).
  • Normalizes coordinates and reverse-complements backward-strand matches.
  • Deduplicates overlapping hits by retaining the match with fewer errors.

Sequence Splitting (SplitPattern)

Divides input sequences into fragments between matched pattern pairs, producing annotated output:

  • Each fragment is labeled with:
    • obisplit_frg: Fragment number (1-based).
    • obisplit_nfrg: Total fragment count.
    • obisplit_group: Pattern-pair name (e.g., "primerA-primerB"), or "extremity" for terminal regions.
    • obisplit_set: Relevant pattern set (e.g., "primerA"), or "NA".
    • obisplit_location: Genomic span (1-based, inclusive).
  • Includes left/right pattern metadata: name, matched substring, and error count.

Pipeline Integration

  • SplitPatternWorker: Wraps splitting logic as a reusable SeqWorker, compatible with OBITools4s streaming infrastructure.
  • CLISplitPipeline: CLI entry point integrating pattern detection and splitting into a parallelizable, configurable pipeline.

Configuration & Usage

  • CSV-based config: Maps tag sequences to pcr_pool identifiers (required columns: tag, optionally reverse_tag).
  • CLI flags:
    • -C, --config: Load pattern definitions from CSV.
    • --template: Output sample config for rapid setup.
    • --pattern-error N: Max mismatches allowed (default: 4).
    • --allows-indels: Enable insertion/deletion-aware matching.
  • Error handling: Validates config structure, pattern compilation, and file access; logs fatal issues.

Design Goals

Optimized for high-throughput amplicon processing, obisplit bridges pattern detection and fragment extraction with minimal assumptions—ensuring flexibility for diverse molecular tagging schemes.