mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
78 lines
3.9 KiB
Markdown
78 lines
3.9 KiB
Markdown
# `obingslibrary`: High-Throughput Sequencing Demultiplexing Library
|
||
|
||
`obingslibrary` is a Go package for **sample assignment in amplicon-based NGS workflows**, using dual-indexed barcodes flanked by PCR primers. It enables robust, configurable demultiplexing of sequencing reads—even in the presence of errors or indels—by matching primer–tag patterns and assigning samples via tag lookup.
|
||
|
||
---
|
||
|
||
## Core Functionalities
|
||
|
||
### 1. **Primer & Tag Configuration**
|
||
- `Marker`: Defines a primer pair (forward/reverse), including:
|
||
- Primer sequences (`Forward`, `Reverse`) and reverse-complement variants.
|
||
- Tag specifications: lengths, spacers (e.g., `N` or fixed nucleotides), delimiters.
|
||
- Mismatch/indel tolerance per direction (`SetAllowedMismatch`, `SetTagIndels`).
|
||
- **Compilation**:
|
||
- `Compile()` / `Compile2()`: Builds internal pattern indexes (via `obiapat.ApatPattern`) for fast, error-tolerant matching.
|
||
- Supports `"strict"`, `"hamming"` (substitutions only), or `"indel"` (Levenshtein) matching modes.
|
||
|
||
### 2. **Sequence Matching & Demultiplexing**
|
||
- `Match(sequence)`: Scans a `BioSequence` for valid primer bindings:
|
||
- Prioritizes forward-primer detection; falls back to reverse orientation.
|
||
- Returns `DemultiplexMatch` with:
|
||
- Primer positions, mismatches, orientation (`IsDirect`).
|
||
- Barcode coordinates (`BarcodeStart`, `BarcodeEnd`) and validity flag.
|
||
- **Primer dimer detection**: If `BarcodeStart > BarcodeEnd`, the read is flagged as invalid.
|
||
|
||
### 3. **Tag Extraction & Annotation**
|
||
- `ExtractBarcode(sequence, inplace)`:
|
||
- Extracts the barcode region between forward/reverse primers.
|
||
- Reverse-complements if read is in reverse orientation (`IsDirect == false`).
|
||
- Annotates the sequence with:
|
||
- Primer names, positions, mismatches.
|
||
- Sample/experiment info (if tag assignment succeeds).
|
||
- Error messages (`Unassigned`, `NoMatch`, etc.).
|
||
- **Tag extraction strategies**:
|
||
- `Fixed`: Fixed-length tags.
|
||
- `Delimited`: Tags flanked by exact delimiters (e.g., `"NN"`).
|
||
- `Rescue`: Tolerates indels in delimiter or tag boundaries.
|
||
|
||
### 4. **Sample Registration & Lookup**
|
||
- `GetPCR(tagPair)`: Retrieves or registers a new PCR reaction indexed by tag pair (case-insensitive).
|
||
- `NGSLibrary.Markers`: Map of primer pairs → `Marker` objects.
|
||
- Lazy initialization via `GetMarker()` for new primers.
|
||
|
||
### 5. **Validation & Consistency Checks**
|
||
- `CheckTagLength()`: Ensures all registered tags have uniform length per direction.
|
||
- `CheckPrimerUnicity()`: Validates no primer is reused across markers; prevents self-complementary pairs.
|
||
|
||
### 6. **Batch Processing & Parallelism**
|
||
- `ExtractBarcodeSlice(sequences, options)`: Processes a slice of reads.
|
||
- Configurable via `Options` (fluent API):
|
||
- Mismatch/indel budgets.
|
||
- Error handling (`discardErrors`, `OptionUnidentified`).
|
||
- Parallel workers, batch size.
|
||
- `ExtractBarcodeSliceWorker()`: Returns a reusable worker for concurrent pipelines.
|
||
|
||
### 7. **Distance Metrics**
|
||
- `Hamming(s1, s2)`: Counts mismatches between equal-length strings.
|
||
- `Levenshtein(s1, s2)`: Computes edit distance (supports indels).
|
||
|
||
### 8. **Sample Identification**
|
||
- `TagExtractor`: Extracts forward/reverse tags from primer-flanked regions.
|
||
- `SampleIdentifier`:
|
||
- Matches extracted tags to known samples using configured strategy (`strict`, `hamming`, or `indel`).
|
||
- Returns best-matching sample, distance, and proposed tags.
|
||
|
||
---
|
||
|
||
## Design Highlights
|
||
- **Memory-efficient**: Uses reference-counted sequences (`Recycle()`).
|
||
- **Error-aware**: Rich error propagation (stored in `DemultiplexMatch.Error` or annotations).
|
||
- **Flexible tag design**: Supports fixed, delimited (exact), and indel-resilient tags.
|
||
- **Extensible via options**: Functional setters for clean, testable configuration.
|
||
|
||
---
|
||
|
||
## Use Case
|
||
Ideal for **metabarcoding or targeted amplicon sequencing**, where samples are multiplexed using unique dual barcodes. Ensures high specificity (unique tag pairs) and sensitivity (error-tolerant matching).
|