mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
3.1 KiB
3.1 KiB
obimicrosat: Microsatellite Detection Module for OBITools4
This Go package provides a modular, CLI-integrated framework to detect and annotate simple sequence repeats (SSRs), also known as microsatellites, in biological DNA sequences. It is designed for integration into sequence processing pipelines—especially those focused on marker discovery, PCR primer design, or genomic feature annotation.
Core Capabilities
1. Flexible Microsatellite Detection
- Detects tandem repeats of DNA motifs (units) with user-defined constraints:
- Unit length range (
minUnitLengthtomaxUnitLength, typically 1–6 bp) - Minimum repeat count (
minUnits) - Total microsatellite length threshold (
minLength)
- Unit length range (
- Uses robust regex-based scanning via
regexp2, followed by precise boundary refinement.
2. Canonical Unit Normalization
- Determines the lexicographically smallest rotation of the detected unit.
- Optionally computes its reverse complement to define orientation (
directorreverse). - If enabled, reorients the full microsatellite region to its canonical (smallest) form.
3. Flanking Sequence Validation
- Ensures sufficient unique sequence on both sides of the repeat (
minflankLength). - Stores flanking regions as
microsat_leftandmicrosat_right.
4. Structured Annotation Output
Each detected microsatellite enriches the input BioSequence with standardized attributes:
microsat_unit_length,microsat_unit_countseq_length(full repeat region length),microsat(repeat sequence)- Positions:
microsat_from,microsat_to - Canonical unit:
microsat_unit_normalized - Orientation flag (
direct/reverse) and flanks
5. CLI Integration & Pipeline Compatibility
MicroSatelliteOptionSet()registers all detection parameters for CLI use (viago-getoptions).- Supported flags:
-m, --min-unit-length: min unit size (default:1)-M, --max-unit-length: max unit size (default:6)--min-unit-count: min repeat count (default:5)-l, --min-length: total SSR length threshold (default:20)-f, --min-flank-length: required flanking length (default:0)-n, --not-reoriented: disable sequence reorientation
- Helper functions (e.g.,
CLIMinUnitCount(),CLIReoriented()) expose runtime config. MakeMicrosatWorker()returns a reusableSeqWorkerfor parallel, iterator-based processing.CLIAnnotateMicrosat()integrates the worker into a conversion pipeline, filtering sequences without qualifying SSRs.
6. Dependencies & Ecosystem Integration
- Built on
obitools4core types (BioSequence, iterators, default annotation schema). - Uses only external dependency:
github.com/dlclark/regexp2for advanced regex support. - Fully compatible with existing
obiconvert.OptionSet; extends it viaOptionSet().
Use Cases
- Identification of polymorphic SSR markers for population genetics.
- Preprocessing step in PCR primer design tools (to avoid repeat-rich regions).
- Quality control: flagging low-complexity sequences in NGS data.
Note
: Only public APIs are documented. Internal helpers (e.g.,
min_unit, rotation logic) remain implementation details.