mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.6 KiB
2.6 KiB
obisplit Package: Semantic Description
The obisplit package enables targeted splitting of biological sequences using user-defined pattern pairs (e.g., primers, barcodes), supporting approximate matching and robust annotation of resulting fragments—ideal for demultiplexing in metabarcoding or amplicon sequencing pipelines.
Core Concepts
SplitSequence: Represents a pattern pair (forward/reverse) with an associated group name. Used to define searchable molecular tags.Pattern_match: Encapsulates a detected pattern instance, including name, genomic coordinates (1-based), error count, and orientation.
Pattern Detection (LocatePatterns)
Scans a sequence for all forward/reverse pattern occurrences using fuzzy matching (mismatches and optionally indels):
- Accepts raw or indexed sequences for efficient lookup.
- Detects matches with configurable error tolerance (default: ≤4 mismatches).
- Normalizes coordinates and reverse-complements backward-strand matches.
- Deduplicates overlapping hits by retaining the match with fewer errors.
Sequence Splitting (SplitPattern)
Divides input sequences into fragments between matched pattern pairs, producing annotated output:
- Each fragment is labeled with:
obisplit_frg: Fragment number (1-based).obisplit_nfrg: Total fragment count.obisplit_group: Pattern-pair name (e.g.,"primerA-primerB"), or"extremity"for terminal regions.obisplit_set: Relevant pattern set (e.g.,"primerA"), or"NA".obisplit_location: Genomic span (1-based, inclusive).
- Includes left/right pattern metadata: name, matched substring, and error count.
Pipeline Integration
SplitPatternWorker: Wraps splitting logic as a reusableSeqWorker, compatible with OBITools4’s streaming infrastructure.CLISplitPipeline: CLI entry point integrating pattern detection and splitting into a parallelizable, configurable pipeline.
Configuration & Usage
- CSV-based config: Maps
tagsequences topcr_poolidentifiers (required columns:tag, optionallyreverse_tag). - CLI flags:
-C, --config: Load pattern definitions from CSV.--template: Output sample config for rapid setup.--pattern-error N: Max mismatches allowed (default: 4).--allows-indels: Enable insertion/deletion-aware matching.
- Error handling: Validates config structure, pattern compilation, and file access; logs fatal issues.
Design Goals
Optimized for high-throughput amplicon processing, obisplit bridges pattern detection and fragment extraction with minimal assumptions—ensuring flexibility for diverse molecular tagging schemes.