⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,205 @@
+# NAME
+
+obimicrosat — looks for microsatellites sequences in a sequence file
+
+---
+
+# SYNOPSIS
+
+```
+obimicrosat [options] [<filename>...]
+```
+
+---
+
+# DESCRIPTION
+
+`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
+microsatellites — tandem repetitions of a short motif (1–6 bp by default). For each
+sequence containing a qualifying repeat, the command annotates it with the location,
+unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
+with no detected microsatellite are silently discarded.
+
+The detection works in two passes. A first regular expression finds any tandem repeat
+satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
+is then determined, and a second scan refines the exact boundaries. The repeat unit is
+normalized to its lexicographically smallest rotation across all rotations and its
+reverse complement, which allows equivalent loci to be grouped consistently across
+samples.
+
+By default, when the canonical form of a unit requires the reverse complement, the
+whole sequence is reoriented so that the microsatellite is always reported on the
+direct strand of the normalized unit. This behaviour can be disabled with
+`--not-reoriented`.
+
+A common use case is identifying polymorphic SSR markers for population genetics, or
+flagging repeat-rich regions before designing PCR primers.
+
+---
+
+# INPUT
+
+Accepts one or more sequence files on the command line. If no file is given, sequences
+are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
+EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
+Format is detected automatically unless overridden by input flags.
+
+---
+
+# OUTPUT
+
+Outputs only the sequences in which a microsatellite was found. Each retained sequence
+carries the following additional attributes:
+
+| Attribute | Content |
+|---|---|
+| `microsat` | Full repeat region as a string |
+| `microsat_from` | 1-based start position of the repeat |
+| `microsat_to` | End position of the repeat (inclusive) |
+| `microsat_unit` | Repeat unit as observed in the sequence |
+| `microsat_unit_normalized` | Lexicographically smallest canonical form |
+| `microsat_unit_orientation` | `direct` or `reverse` |
+| `microsat_unit_length` | Length of the repeat unit (bp) |
+| `microsat_unit_count` | Number of complete unit repetitions |
+| `seq_length` | Total length of the (possibly reoriented) sequence |
+| `microsat_left` | Flanking sequence to the left of the repeat |
+| `microsat_right` | Flanking sequence to the right of the repeat |
+
+When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
+identifier.
+
+The output format follows the same rules as the rest of OBITools4: FASTQ when quality
+scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
+
+## Observed output example
+
+```
+>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
+agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
+cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
+>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
+tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
+acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
+```
+
+---
+
+# OPTIONS
+
+## Microsatellite detection
+
+**`--min-unit-length` / `-m`**
+- Default: `1`
+- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
+  mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
+
+**`--max-unit-length` / `-M`**
+- Default: `6`
+- Maximum length in base pairs of the repeated motif. Increasing this value detects
+  longer repeat units (minisatellites) at the cost of more complex patterns.
+
+**`--min-unit-count`**
+- Default: `5`
+- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
+  requires at least 10 bp of pure repeat.
+
+**`--min-length` / `-l`**
+- Default: `20`
+- Minimum total length (in bp) of the repeat region. This filter applies after the
+  unit-count filter and is useful to exclude very short but technically qualifying
+  repeats.
+
+**`--min-flank-length` / `-f`**
+- Default: `0`
+- Minimum length of the flanking sequence on each side of the repeat. Sequences with
+  flanks shorter than this threshold are discarded, which is useful when the output
+  will feed a primer-design step.
+
+**`--not-reoriented` / `-n`**
+- Default: `false` (reorientation is active by default)
+- When set, sequences are never reverse-complemented to match the canonical orientation
+  of the repeat unit. The microsatellite is reported as found, in its original
+  orientation.
+
+## Input / output
+
+Inherited from the standard OBITools4 conversion layer. Common flags include:
+
+**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
+**`--input-json-header`** — parse JSON-encoded headers.
+**`--skip-empty`** — skip sequences with no nucleotides.
+**`--u-to-t`** — convert U to T (RNA → DNA).
+**`--output-json-header`** — write JSON-encoded headers.
+**`--output-obi-header`** — write OBI-style headers.
+**`--gzip`** — compress output with gzip.
+**`--workers` / `-p`** — number of parallel processing workers.
+
+---
+
+# EXAMPLES
+
+```bash
+# Detect default microsatellites (unit 1–6 bp, ≥5 repeats, ≥20 bp total)
+obimicrosat sequences.fasta > out_default.fasta
+```
+
+**Expected output:** 6 sequences written to `out_default.fasta`.
+
+```bash
+# Restrict to di- and trinucleotide repeats only
+obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
+```
+
+**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
+(mononucleotide and tetranucleotide repeats excluded).
+
+```bash
+# Require at least 30 bp flanking sequence on each side (for primer design)
+obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
+```
+
+**Expected output:** 3 sequences written to `out_primer_ready.fasta`
+(sequences with flanks shorter than 30 bp are discarded).
+
+```bash
+# Keep sequences in their original orientation (no reverse-complement)
+obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
+```
+
+**Expected output:** 6 sequences written to `out_no_reorient.fasta`
+(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
+
+```bash
+# Require at least 8 repeat units and a minimum repeat length of 30 bp
+obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
+```
+
+**Expected output:** 4 sequences written to `out_strict.fasta`
+(short or low-count repeats excluded).
+
+---
+
+# SEE ALSO
+
+`obigrep` — filter sequences by annotation after microsatellite detection.
+`obiannotate` — add or modify sequence annotations.
+`obiconvert` — format conversion for sequence files.
+
+---
+
+# NOTES
+
+- Only sequences with at least one qualifying microsatellite are written to output;
+  all others are silently filtered out.
+- The normalization algorithm considers all rotations of the unit and their reverse
+  complements, selecting the lexicographically smallest string. This ensures consistent
+  grouping of loci regardless of which strand was sequenced.
+- When reorientation is active (the default), sequences whose canonical unit falls on
+  the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
+  Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
+  (possibly reoriented) output sequence.
+- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
+  first match is reported per sequence.
+- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
+  repeat within the flank instead of the intended SSR. When designing synthetic test
+  data, ensure flanking regions do not contain tandem repeat motifs of their own.