mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,205 @@
|
||||
# NAME
|
||||
|
||||
obimicrosat — looks for microsatellites sequences in a sequence file
|
||||
|
||||
---
|
||||
|
||||
# SYNOPSIS
|
||||
|
||||
```
|
||||
obimicrosat [options] [<filename>...]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# DESCRIPTION
|
||||
|
||||
`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
|
||||
microsatellites — tandem repetitions of a short motif (1–6 bp by default). For each
|
||||
sequence containing a qualifying repeat, the command annotates it with the location,
|
||||
unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
|
||||
with no detected microsatellite are silently discarded.
|
||||
|
||||
The detection works in two passes. A first regular expression finds any tandem repeat
|
||||
satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
|
||||
is then determined, and a second scan refines the exact boundaries. The repeat unit is
|
||||
normalized to its lexicographically smallest rotation across all rotations and its
|
||||
reverse complement, which allows equivalent loci to be grouped consistently across
|
||||
samples.
|
||||
|
||||
By default, when the canonical form of a unit requires the reverse complement, the
|
||||
whole sequence is reoriented so that the microsatellite is always reported on the
|
||||
direct strand of the normalized unit. This behaviour can be disabled with
|
||||
`--not-reoriented`.
|
||||
|
||||
A common use case is identifying polymorphic SSR markers for population genetics, or
|
||||
flagging repeat-rich regions before designing PCR primers.
|
||||
|
||||
---
|
||||
|
||||
# INPUT
|
||||
|
||||
Accepts one or more sequence files on the command line. If no file is given, sequences
|
||||
are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
|
||||
EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
|
||||
Format is detected automatically unless overridden by input flags.
|
||||
|
||||
---
|
||||
|
||||
# OUTPUT
|
||||
|
||||
Outputs only the sequences in which a microsatellite was found. Each retained sequence
|
||||
carries the following additional attributes:
|
||||
|
||||
| Attribute | Content |
|
||||
|---|---|
|
||||
| `microsat` | Full repeat region as a string |
|
||||
| `microsat_from` | 1-based start position of the repeat |
|
||||
| `microsat_to` | End position of the repeat (inclusive) |
|
||||
| `microsat_unit` | Repeat unit as observed in the sequence |
|
||||
| `microsat_unit_normalized` | Lexicographically smallest canonical form |
|
||||
| `microsat_unit_orientation` | `direct` or `reverse` |
|
||||
| `microsat_unit_length` | Length of the repeat unit (bp) |
|
||||
| `microsat_unit_count` | Number of complete unit repetitions |
|
||||
| `seq_length` | Total length of the (possibly reoriented) sequence |
|
||||
| `microsat_left` | Flanking sequence to the left of the repeat |
|
||||
| `microsat_right` | Flanking sequence to the right of the repeat |
|
||||
|
||||
When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
|
||||
identifier.
|
||||
|
||||
The output format follows the same rules as the rest of OBITools4: FASTQ when quality
|
||||
scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
|
||||
|
||||
## Observed output example
|
||||
|
||||
```
|
||||
>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
|
||||
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
|
||||
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
|
||||
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
|
||||
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
|
||||
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# OPTIONS
|
||||
|
||||
## Microsatellite detection
|
||||
|
||||
**`--min-unit-length` / `-m`**
|
||||
- Default: `1`
|
||||
- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
|
||||
mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
|
||||
|
||||
**`--max-unit-length` / `-M`**
|
||||
- Default: `6`
|
||||
- Maximum length in base pairs of the repeated motif. Increasing this value detects
|
||||
longer repeat units (minisatellites) at the cost of more complex patterns.
|
||||
|
||||
**`--min-unit-count`**
|
||||
- Default: `5`
|
||||
- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
|
||||
requires at least 10 bp of pure repeat.
|
||||
|
||||
**`--min-length` / `-l`**
|
||||
- Default: `20`
|
||||
- Minimum total length (in bp) of the repeat region. This filter applies after the
|
||||
unit-count filter and is useful to exclude very short but technically qualifying
|
||||
repeats.
|
||||
|
||||
**`--min-flank-length` / `-f`**
|
||||
- Default: `0`
|
||||
- Minimum length of the flanking sequence on each side of the repeat. Sequences with
|
||||
flanks shorter than this threshold are discarded, which is useful when the output
|
||||
will feed a primer-design step.
|
||||
|
||||
**`--not-reoriented` / `-n`**
|
||||
- Default: `false` (reorientation is active by default)
|
||||
- When set, sequences are never reverse-complemented to match the canonical orientation
|
||||
of the repeat unit. The microsatellite is reported as found, in its original
|
||||
orientation.
|
||||
|
||||
## Input / output
|
||||
|
||||
Inherited from the standard OBITools4 conversion layer. Common flags include:
|
||||
|
||||
**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
|
||||
**`--input-json-header`** — parse JSON-encoded headers.
|
||||
**`--skip-empty`** — skip sequences with no nucleotides.
|
||||
**`--u-to-t`** — convert U to T (RNA → DNA).
|
||||
**`--output-json-header`** — write JSON-encoded headers.
|
||||
**`--output-obi-header`** — write OBI-style headers.
|
||||
**`--gzip`** — compress output with gzip.
|
||||
**`--workers` / `-p`** — number of parallel processing workers.
|
||||
|
||||
---
|
||||
|
||||
# EXAMPLES
|
||||
|
||||
```bash
|
||||
# Detect default microsatellites (unit 1–6 bp, ≥5 repeats, ≥20 bp total)
|
||||
obimicrosat sequences.fasta > out_default.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_default.fasta`.
|
||||
|
||||
```bash
|
||||
# Restrict to di- and trinucleotide repeats only
|
||||
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
|
||||
(mononucleotide and tetranucleotide repeats excluded).
|
||||
|
||||
```bash
|
||||
# Require at least 30 bp flanking sequence on each side (for primer design)
|
||||
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 3 sequences written to `out_primer_ready.fasta`
|
||||
(sequences with flanks shorter than 30 bp are discarded).
|
||||
|
||||
```bash
|
||||
# Keep sequences in their original orientation (no reverse-complement)
|
||||
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 6 sequences written to `out_no_reorient.fasta`
|
||||
(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
|
||||
|
||||
```bash
|
||||
# Require at least 8 repeat units and a minimum repeat length of 30 bp
|
||||
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
|
||||
```
|
||||
|
||||
**Expected output:** 4 sequences written to `out_strict.fasta`
|
||||
(short or low-count repeats excluded).
|
||||
|
||||
---
|
||||
|
||||
# SEE ALSO
|
||||
|
||||
`obigrep` — filter sequences by annotation after microsatellite detection.
|
||||
`obiannotate` — add or modify sequence annotations.
|
||||
`obiconvert` — format conversion for sequence files.
|
||||
|
||||
---
|
||||
|
||||
# NOTES
|
||||
|
||||
- Only sequences with at least one qualifying microsatellite are written to output;
|
||||
all others are silently filtered out.
|
||||
- The normalization algorithm considers all rotations of the unit and their reverse
|
||||
complements, selecting the lexicographically smallest string. This ensures consistent
|
||||
grouping of loci regardless of which strand was sequenced.
|
||||
- When reorientation is active (the default), sequences whose canonical unit falls on
|
||||
the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
|
||||
Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
|
||||
(possibly reoriented) output sequence.
|
||||
- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
|
||||
first match is reported per sequence.
|
||||
- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
|
||||
repeat within the flank instead of the intended SSR. When designing synthetic test
|
||||
data, ensure flanking regions do not contain tandem repeat motifs of their own.
|
||||
Reference in New Issue
Block a user