Files
obitools4/autodoc/cmd/obimicrosat.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

8.3 KiB
Raw Blame History

NAME

obimicrosat — looks for microsatellites sequences in a sequence file


SYNOPSIS

obimicrosat [options] [<filename>...]

DESCRIPTION

obimicrosat scans DNA sequences for simple sequence repeats (SSRs), also called microsatellites — tandem repetitions of a short motif (16 bp by default). For each sequence containing a qualifying repeat, the command annotates it with the location, unit sequence, repeat count, and flanking regions, then writes it to output. Sequences with no detected microsatellite are silently discarded.

The detection works in two passes. A first regular expression finds any tandem repeat satisfying the unit-length and repeat-count constraints. The true minimal repeat unit is then determined, and a second scan refines the exact boundaries. The repeat unit is normalized to its lexicographically smallest rotation across all rotations and its reverse complement, which allows equivalent loci to be grouped consistently across samples.

By default, when the canonical form of a unit requires the reverse complement, the whole sequence is reoriented so that the microsatellite is always reported on the direct strand of the normalized unit. This behaviour can be disabled with --not-reoriented.

A common use case is identifying polymorphic SSR markers for population genetics, or flagging repeat-rich regions before designing PCR primers.


INPUT

Accepts one or more sequence files on the command line. If no file is given, sequences are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank, EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently. Format is detected automatically unless overridden by input flags.


OUTPUT

Outputs only the sequences in which a microsatellite was found. Each retained sequence carries the following additional attributes:

Attribute Content
microsat Full repeat region as a string
microsat_from 1-based start position of the repeat
microsat_to End position of the repeat (inclusive)
microsat_unit Repeat unit as observed in the sequence
microsat_unit_normalized Lexicographically smallest canonical form
microsat_unit_orientation direct or reverse
microsat_unit_length Length of the repeat unit (bp)
microsat_unit_count Number of complete unit repetitions
seq_length Total length of the (possibly reoriented) sequence
microsat_left Flanking sequence to the left of the repeat
microsat_right Flanking sequence to the right of the repeat

When a sequence is reoriented (reverse-complemented), _cmp is appended to its identifier.

The output format follows the same rules as the rest of OBITools4: FASTQ when quality scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.

Observed output example

>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact

OPTIONS

Microsatellite detection

--min-unit-length / -m

  • Default: 1
  • Minimum length in base pairs of the repeated motif. Set to 2 to exclude mononucleotide repeats, 3 for di- and mononucleotide-free searches, etc.

--max-unit-length / -M

  • Default: 6
  • Maximum length in base pairs of the repeated motif. Increasing this value detects longer repeat units (minisatellites) at the cost of more complex patterns.

--min-unit-count

  • Default: 5
  • Minimum number of times the motif must be repeated. A value of 5 with a 2 bp unit requires at least 10 bp of pure repeat.

--min-length / -l

  • Default: 20
  • Minimum total length (in bp) of the repeat region. This filter applies after the unit-count filter and is useful to exclude very short but technically qualifying repeats.

--min-flank-length / -f

  • Default: 0
  • Minimum length of the flanking sequence on each side of the repeat. Sequences with flanks shorter than this threshold are discarded, which is useful when the output will feed a primer-design step.

--not-reoriented / -n

  • Default: false (reorientation is active by default)
  • When set, sequences are never reverse-complemented to match the canonical orientation of the repeat unit. The microsatellite is reported as found, in its original orientation.

Input / output

Inherited from the standard OBITools4 conversion layer. Common flags include:

--input-OBI-header — parse OBI-style FASTA/FASTQ headers. --input-json-header — parse JSON-encoded headers. --skip-empty — skip sequences with no nucleotides. --u-to-t — convert U to T (RNA → DNA). --output-json-header — write JSON-encoded headers. --output-obi-header — write OBI-style headers. --gzip — compress output with gzip. --workers / -p — number of parallel processing workers.


EXAMPLES

# Detect default microsatellites (unit 16 bp, ≥5 repeats, ≥20 bp total)
obimicrosat sequences.fasta > out_default.fasta

Expected output: 6 sequences written to out_default.fasta.

# Restrict to di- and trinucleotide repeats only
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta

Expected output: 4 sequences written to out_dinucleotide.fasta (mononucleotide and tetranucleotide repeats excluded).

# Require at least 30 bp flanking sequence on each side (for primer design)
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta

Expected output: 3 sequences written to out_primer_ready.fasta (sequences with flanks shorter than 30 bp are discarded).

# Keep sequences in their original orientation (no reverse-complement)
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta

Expected output: 6 sequences written to out_no_reorient.fasta (GT-repeat sequence kept as-is without _cmp suffix; microsat_unit_orientation is reverse).

# Require at least 8 repeat units and a minimum repeat length of 30 bp
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta

Expected output: 4 sequences written to out_strict.fasta (short or low-count repeats excluded).


SEE ALSO

obigrep — filter sequences by annotation after microsatellite detection. obiannotate — add or modify sequence annotations. obiconvert — format conversion for sequence files.


NOTES

  • Only sequences with at least one qualifying microsatellite are written to output; all others are silently filtered out.
  • The normalization algorithm considers all rotations of the unit and their reverse complements, selecting the lexicographically smallest string. This ensures consistent grouping of loci regardless of which strand was sequenced.
  • When reorientation is active (the default), sequences whose canonical unit falls on the reverse strand are reverse-complemented and their ID receives the suffix _cmp. Coordinate attributes (microsat_from, microsat_to) always refer to the (possibly reoriented) output sequence.
  • Repetitive low-complexity sequences may match multiple overlapping patterns; only the first match is reported per sequence.
  • Flanking sequences must be non-repetitive to avoid the tool detecting a tandem repeat within the flank instead of the intended SSR. When designing synthetic test data, ensure flanking regions do not contain tandem repeat motifs of their own.