Files
obitools4/autodoc/docmd/pkg_obingslibrary.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

78 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `obingslibrary`: High-Throughput Sequencing Demultiplexing Library
`obingslibrary` is a Go package for **sample assignment in amplicon-based NGS workflows**, using dual-indexed barcodes flanked by PCR primers. It enables robust, configurable demultiplexing of sequencing reads—even in the presence of errors or indels—by matching primertag patterns and assigning samples via tag lookup.
---
## Core Functionalities
### 1. **Primer & Tag Configuration**
- `Marker`: Defines a primer pair (forward/reverse), including:
- Primer sequences (`Forward`, `Reverse`) and reverse-complement variants.
- Tag specifications: lengths, spacers (e.g., `N` or fixed nucleotides), delimiters.
- Mismatch/indel tolerance per direction (`SetAllowedMismatch`, `SetTagIndels`).
- **Compilation**:
- `Compile()` / `Compile2()`: Builds internal pattern indexes (via `obiapat.ApatPattern`) for fast, error-tolerant matching.
- Supports `"strict"`, `"hamming"` (substitutions only), or `"indel"` (Levenshtein) matching modes.
### 2. **Sequence Matching & Demultiplexing**
- `Match(sequence)`: Scans a `BioSequence` for valid primer bindings:
- Prioritizes forward-primer detection; falls back to reverse orientation.
- Returns `DemultiplexMatch` with:
- Primer positions, mismatches, orientation (`IsDirect`).
- Barcode coordinates (`BarcodeStart`, `BarcodeEnd`) and validity flag.
- **Primer dimer detection**: If `BarcodeStart > BarcodeEnd`, the read is flagged as invalid.
### 3. **Tag Extraction & Annotation**
- `ExtractBarcode(sequence, inplace)`:
- Extracts the barcode region between forward/reverse primers.
- Reverse-complements if read is in reverse orientation (`IsDirect == false`).
- Annotates the sequence with:
- Primer names, positions, mismatches.
- Sample/experiment info (if tag assignment succeeds).
- Error messages (`Unassigned`, `NoMatch`, etc.).
- **Tag extraction strategies**:
- `Fixed`: Fixed-length tags.
- `Delimited`: Tags flanked by exact delimiters (e.g., `"NN"`).
- `Rescue`: Tolerates indels in delimiter or tag boundaries.
### 4. **Sample Registration & Lookup**
- `GetPCR(tagPair)`: Retrieves or registers a new PCR reaction indexed by tag pair (case-insensitive).
- `NGSLibrary.Markers`: Map of primer pairs → `Marker` objects.
- Lazy initialization via `GetMarker()` for new primers.
### 5. **Validation & Consistency Checks**
- `CheckTagLength()`: Ensures all registered tags have uniform length per direction.
- `CheckPrimerUnicity()`: Validates no primer is reused across markers; prevents self-complementary pairs.
### 6. **Batch Processing & Parallelism**
- `ExtractBarcodeSlice(sequences, options)`: Processes a slice of reads.
- Configurable via `Options` (fluent API):
- Mismatch/indel budgets.
- Error handling (`discardErrors`, `OptionUnidentified`).
- Parallel workers, batch size.
- `ExtractBarcodeSliceWorker()`: Returns a reusable worker for concurrent pipelines.
### 7. **Distance Metrics**
- `Hamming(s1, s2)`: Counts mismatches between equal-length strings.
- `Levenshtein(s1, s2)`: Computes edit distance (supports indels).
### 8. **Sample Identification**
- `TagExtractor`: Extracts forward/reverse tags from primer-flanked regions.
- `SampleIdentifier`:
- Matches extracted tags to known samples using configured strategy (`strict`, `hamming`, or `indel`).
- Returns best-matching sample, distance, and proposed tags.
---
## Design Highlights
- **Memory-efficient**: Uses reference-counted sequences (`Recycle()`).
- **Error-aware**: Rich error propagation (stored in `DemultiplexMatch.Error` or annotations).
- **Flexible tag design**: Supports fixed, delimited (exact), and indel-resilient tags.
- **Extensible via options**: Functional setters for clean, testable configuration.
---
## Use Case
Ideal for **metabarcoding or targeted amplicon sequencing**, where samples are multiplexed using unique dual barcodes. Ensures high specificity (unique tag pairs) and sensitivity (error-tolerant matching).