mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
49 lines
2.6 KiB
Markdown
49 lines
2.6 KiB
Markdown
# `obisplit` Package: Semantic Description
|
||
|
||
The `obisplit` package enables **targeted splitting of biological sequences** using user-defined pattern pairs (e.g., primers, barcodes), supporting approximate matching and robust annotation of resulting fragments—ideal for demultiplexing in metabarcoding or amplicon sequencing pipelines.
|
||
|
||
## Core Concepts
|
||
|
||
- **`SplitSequence`**: Represents a pattern pair (forward/reverse) with an associated group name. Used to define searchable molecular tags.
|
||
- **`Pattern_match`**: Encapsulates a detected pattern instance, including name, genomic coordinates (1-based), error count, and orientation.
|
||
|
||
## Pattern Detection (`LocatePatterns`)
|
||
|
||
Scans a sequence for all forward/reverse pattern occurrences using **fuzzy matching** (mismatches and optionally indels):
|
||
|
||
- Accepts raw or indexed sequences for efficient lookup.
|
||
- Detects matches with configurable error tolerance (default: ≤4 mismatches).
|
||
- Normalizes coordinates and reverse-complements backward-strand matches.
|
||
- Deduplicates overlapping hits by retaining the match with fewer errors.
|
||
|
||
## Sequence Splitting (`SplitPattern`)
|
||
|
||
Divides input sequences into fragments **between matched pattern pairs**, producing annotated output:
|
||
|
||
- Each fragment is labeled with:
|
||
- `obisplit_frg`: Fragment number (1-based).
|
||
- `obisplit_nfrg`: Total fragment count.
|
||
- `obisplit_group`: Pattern-pair name (e.g., `"primerA-primerB"`), or `"extremity"` for terminal regions.
|
||
- `obisplit_set`: Relevant pattern set (e.g., `"primerA"`), or `"NA"`.
|
||
- `obisplit_location`: Genomic span (1-based, inclusive).
|
||
- Includes left/right pattern metadata: name, matched substring, and error count.
|
||
|
||
## Pipeline Integration
|
||
|
||
- **`SplitPatternWorker`**: Wraps splitting logic as a reusable `SeqWorker`, compatible with OBITools4’s streaming infrastructure.
|
||
- **`CLISplitPipeline`**: CLI entry point integrating pattern detection and splitting into a parallelizable, configurable pipeline.
|
||
|
||
## Configuration & Usage
|
||
|
||
- **CSV-based config**: Maps `tag` sequences to `pcr_pool` identifiers (required columns: `tag`, optionally `reverse_tag`).
|
||
- **CLI flags**:
|
||
- `-C, --config`: Load pattern definitions from CSV.
|
||
- `--template`: Output sample config for rapid setup.
|
||
- `--pattern-error N`: Max mismatches allowed (default: 4).
|
||
- `--allows-indels`: Enable insertion/deletion-aware matching.
|
||
- **Error handling**: Validates config structure, pattern compilation, and file access; logs fatal issues.
|
||
|
||
## Design Goals
|
||
|
||
Optimized for **high-throughput amplicon processing**, `obisplit` bridges pattern detection and fragment extraction with minimal assumptions—ensuring flexibility for diverse molecular tagging schemes.
|