862 lines
32 KiB
Plaintext
862 lines
32 KiB
Plaintext
---
|
||
title: "Designing Fish primers"
|
||
format: revealjs
|
||
editor: visual
|
||
---
|
||
|
||
```{r setup, include=FALSE}
|
||
library(knitr)
|
||
```
|
||
|
||
# Preparing the data
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## What do we need ?
|
||
|
||
To design a new animal DNA metabarcode we download from the NCBI the following data
|
||
|
||
- The complete set of whole mitochondrial genomes
|
||
- The NCBI taxonomy
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## We need also:
|
||
|
||
- a Unix computer: a Mac or a Linux box
|
||
- A unix terminal window for typing commands
|
||
- Installed on the computer
|
||
- The OBITools --- <http://github.com/metabarcoding/obitools4>
|
||
- ecoPrimers --- <http://metabarcoding.org/ecoprimers>
|
||
- R --- <http://wwww.r-project.org>
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Downloading the mitochondrial genomes
|
||
|
||
We can use an internet browser and download the files from NCBI FTP website
|
||
|
||
{fig-align="center"}
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Downloading the mitochondrial genomes
|
||
|
||
We can use an internet browser and download the files from NCBI FTP website
|
||
|
||
or run the following command lines
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
curl 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/mitochondrion.1.genomic.gbff.gz' \
|
||
> mito.all.gb.gz
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
zless mito.all.gb.gz
|
||
```
|
||
|
||
```
|
||
LOCUS NW_009243181 45189 bp DNA linear CON 06-OCT-2014
|
||
DEFINITION Fonticula alba strain ATCC 38817 mitochondrial scaffold
|
||
supercont2.211, whole genome shotgun sequence.
|
||
ACCESSION NW_009243181 NZ_AROH01000000
|
||
VERSION NW_009243181.1
|
||
DBLINK BioProject: PRJNA262900
|
||
Assembly: GCF_000388065.1
|
||
KEYWORDS WGS; RefSeq.
|
||
SOURCE mitochondrion Fonticula alba
|
||
ORGANISM Fonticula alba
|
||
Eukaryota; Rotosphaerida; Fonticulaceae; Fonticula.
|
||
REFERENCE 1 (bases 1 to 45189)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Downloading the complete taxonomy
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obitaxonomy --download-ncbi
|
||
```
|
||
|
||
```
|
||
INFO[0000] Number of workers set 16
|
||
INFO[0000] Downloading NCBI Taxdump to ncbitaxo_20250211.tgz
|
||
downloading 100% ████████████████████████████████████████| (66/66 MB, 5.1 MB/s)
|
||
```
|
||
|
||
The NCBI taxonomy contains all the relationship between taxa. Each taxon is identified by a unique numerical id: `taxid`
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## The archive contains several files
|
||
|
||
file: `nodes.dmp`
|
||
|
||
```
|
||
1 | 1 | no rank | | 8 | 0 | ...
|
||
2 | 131567 | superkingdom | | 0 | 0 |
|
||
6 | 335928 | genus | | 0 | 1 |
|
||
7 | 6 | species | AC | 0 | 1 |
|
||
9 | 32199 | species | BA | 0 |
|
||
10 | 135621 | genus | | 0 |
|
||
11 | 1707 | species | CG | 0 | 1 |
|
||
13 | 203488 | genus | | 0 | 1 |
|
||
14 | 13 | species | DT | 0 | 1 |
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## file: `names.dmp`
|
||
|
||
```
|
||
1 | root | | scientific name |
|
||
2 | Bacteria | Bacteria <prokaryote> | scientific name |
|
||
2 | Monera | Monera <Bacteria> | in-part |
|
||
2 | Procaryotae | Procaryotae <Bacteria> | in-part |
|
||
2 | Prokaryota | Prokaryota <Bacteria> | in-part |
|
||
2 | Prokaryotae | Prokaryotae <Bacteria> | in-part |
|
||
2 | bacteria | bacteria <blast2> | blast name |
|
||
2 | eubacteria | | genbank common name |
|
||
2 | prokaryote | prokaryote <Bacteria> | in-part |
|
||
...
|
||
10 | Cellvibrio | | scientific name |
|
||
11 | [Cellvibrio] gilvus | | scientific name |
|
||
13 | Dictyoglomus | | scientific name |
|
||
14 | Dictyoglomus thermophilum | | scientific name |
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Preparing the set of complete genomes
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obiconvert --skip-empty \
|
||
--update-taxid \
|
||
-t ncbitaxo_20250211.tgz \
|
||
mito.all.gb.gz \
|
||
> mito.all.fasta
|
||
head -5 mito.all.fasta
|
||
```
|
||
|
||
five first lines of the new `mito.all.fasta` file
|
||
|
||
```
|
||
>NC_072933 {"definition":"Echinosophora koreensis mitochondrion, complete genome.","scientific_name":"mitochondrion Echinosophora koreensis","taxid":228658}
|
||
ctttcgggtcggaaatagaagatctggattagatcccttctcgatagctttagtcagagc
|
||
tcatccctcgaaaaagggagtagtgagatgagaaaagggtgactagaatacggaaattca
|
||
actagtgaagtcagatccgggaattccactattgaagttatccgtcttaggcttcaagca
|
||
agctatctttcaaggaagtcagtctaagccctaagccaagatctgctttttgccagtcaa
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## We want:
|
||
|
||
- annotate sequences by their species `taxid`
|
||
- keep a single genome per species
|
||
- extract only vertebrate genome
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Looking for the **Vertebrata**'s taxid
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obitaxonomy -t ncbitaxo_20250211.tgz \
|
||
--fixed \
|
||
'vertebrata'
|
||
```
|
||
|
||
``` csv
|
||
taxid,parent,taxonomic_rank,scientific_name
|
||
taxon:1261581 [Vertebrata]@genus,taxon:2008651 [Polysiphonioideae]@subfamily,genus,Vertebrata
|
||
taxon:7742 [Vertebrata]@clade,taxon:89593 [Craniata]@subphylum,clade,Vertebrata
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Looking for the **Vertebrata**'s taxid
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obitaxonomy -t ncbitaxo_20250211.tgz \
|
||
--fixed \
|
||
'vertebrata' \
|
||
| csvlook
|
||
```
|
||
|
||
``` csv
|
||
| taxid | parent | taxonomic_rank | scientific_name |
|
||
| -------------------------------- | ------------------------------------------- | -------------- | --------------- |
|
||
| taxon:1261581 [Vertebrata]@genus | taxon:2008651 [Polysiphonioideae]@subfamily | genus | Vertebrata |
|
||
| taxon:7742 [Vertebrata]@clade | taxon:89593 [Craniata]@subphylum | clade | Vertebrata |
|
||
```
|
||
|
||
## A genus called **Vertebrata**
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obitaxonomy -t ncbitaxo_20250211.tgz \
|
||
-p 2008651 \
|
||
| csvlook
|
||
```
|
||
|
||
``` csv
|
||
| taxid | parent | taxonomic_rank | scientific_name |
|
||
| ------------------------------------------- | ------------------------------------------- | -------------- | ------------------ |
|
||
| taxon:2008651 [Polysiphonioideae]@subfamily | taxon:2803 [Rhodomelaceae]@family | subfamily | Polysiphonioideae |
|
||
| taxon:2803 [Rhodomelaceae]@family | taxon:2802 [Ceramiales]@order | family | Rhodomelaceae |
|
||
| taxon:2802 [Ceramiales]@order | taxon:2045261 [Rhodymeniophycidae]@subclass | order | Ceramiales |
|
||
| taxon:2045261 [Rhodymeniophycidae]@subclass | taxon:2806 [Florideophyceae]@class | subclass | Rhodymeniophycidae |
|
||
| taxon:2806 [Florideophyceae]@class | taxon:2763 [Rhodophyta]@phylum | class | Florideophyceae |
|
||
| taxon:2763 [Rhodophyta]@phylum | taxon:2759 [Eukaryota]@superkingdom | phylum | Rhodophyta |
|
||
| taxon:2759 [Eukaryota]@superkingdom | taxon:131567 [cellular organisms]@no rank | superkingdom | Eukaryota |
|
||
| taxon:131567 [cellular organisms]@no rank | taxon:1 [root]@no rank | no rank | cellular organisms |
|
||
| taxon:1 [root]@no rank | taxon:1 [root]@no rank | no rank | root |
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Reannotation and selection of the genomes
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obiannotate -t ncbitaxo_20250211.tgz \
|
||
--with-taxon-at-rank=species \
|
||
mito.all.fasta | \
|
||
obiannotate -S 'ori_taxid=annotations.taxid' | \
|
||
obiannotate -S 'taxid=annotations.species_taxid' | \
|
||
obiuniq -c taxid > mito.one.fasta
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Species representation
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
|
||
obicsv -k taxid mito.one.fasta \
|
||
| tail -n +2 \
|
||
| sort \
|
||
| uniq -c \
|
||
| sort -nk1 \
|
||
| cut -w -f 2 \
|
||
| uplot count
|
||
```
|
||
|
||
```
|
||
┌ ┐
|
||
1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 17769.0
|
||
2 ┤ 90.0
|
||
3 ┤ 17.0
|
||
4 ┤ 5.0
|
||
5 ┤ 4.0
|
||
6 ┤ 2.0
|
||
7 ┤ 1.0
|
||
└ ┘
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Selection of the vertebrata genomes
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
|
||
obigrep -t ncbitaxo_20250211.tgz \
|
||
-r 7742 \
|
||
mito.one.fasta > mito.vert.fasta
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
|
||
obicount mito.vert.fasta \
|
||
| csvlook
|
||
```
|
||
|
||
```
|
||
| entities | n |
|
||
| -------- | ----------- |
|
||
| variants | 7,822 |
|
||
| reads | 7,823 |
|
||
| symbols | 131,378,756 |
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Prepare data for ecoPrimers 1/3
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
mkdir ncbitaxo_20250211
|
||
cd ncbitaxo_20250211
|
||
tar zxvf ../ncbitaxo_20250211.tgz
|
||
cd ..
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Prepare data for ecoPrimers 2/3
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obiconvert -O mito.vert.fasta > mito.vert.old.fasta
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
head -5 mito.vert.old.fasta
|
||
```
|
||
|
||
``` csv
|
||
>NC_071784 taxid=taxon:2065826 [Sineleotris saccharae]@species; count=1; ori_taxid=taxon:2065826 [Sineleotris saccharae]@species; scientific_name=mitochondrion Sineleotris saccharae; species_name=Sineleotris saccharae; species_taxid=taxon:2065826 [Sineleotris saccharae]@species; Sineleotris saccharae mitochondrion, complete genome.
|
||
gctagcgtagcttaaccaaagcataacactgaagatgttaagatgggccctagaaagccc
|
||
cgcaagcacaaaagcttggtcctggctttactatcagcttaggctaaacttacacatgca
|
||
agtatccgcatccccgtgagaatgcccttaagctcccaccgctaacaggagtcaaggagc
|
||
cggtatcaggcacaaccctgagttagcccacgacaccttgctcagccacacccccaaggg
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Prepare data for ecoPrimers 3/3
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
ecoPCRFormat -t ncbitaxo_20250211 \
|
||
-f \
|
||
-n vertebrata \
|
||
mito.vert.old.fasta
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
ls -l vertebrata*
|
||
```
|
||
|
||
```
|
||
-rw-r--r--@ 1 coissac staff 260899785 Feb 11 11:53 vertabrata.ndx
|
||
-rw-r--r--@ 1 coissac staff 546 Feb 11 11:53 vertabrata.rdx
|
||
-rw-r--r--@ 1 coissac staff 121379751 Feb 11 11:53 vertabrata.tdx
|
||
-rw-r--r--@ 1 coissac staff 40446318 Feb 11 11:54 vertabrata_001.sdx
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Looking for the *Teleostei* `taxid`
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obitaxonomy -t ncbitaxo_20250211.tgz \
|
||
--fixed \
|
||
'Teleostei' \
|
||
| csvlook
|
||
```
|
||
|
||
``` csv
|
||
| taxid | parent | taxonomic_rank | scientific_name |
|
||
| ---------------------------------- | ---------------------------------- | -------------- | --------------- |
|
||
| taxon:32443 [Teleostei]@infraclass | taxon:41665 [Neopterygii]@subclass | infraclass | Teleostei |
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Selecting the best primer pairs
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
ecoPrimers -d vertebrata \
|
||
-e 3 -3 2 \
|
||
-l 30 -L 150 \
|
||
-r 32443 \
|
||
-c > Teleostei.ecoprimers
|
||
```
|
||
|
||
- Total pair count : 9407
|
||
- Total good pair count : 407
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
head -35 Teleostei.ecoprimers
|
||
```
|
||
|
||
``` csv
|
||
#
|
||
# ecoPrimer version 0.5
|
||
# Rank level optimisation : species
|
||
# max error count by oligonucleotide : 3
|
||
#
|
||
# Restricted to taxon:
|
||
# 32443 : Teleostei (infraclass)
|
||
#
|
||
# strict primer quorum : 0.70
|
||
# example quorum : 0.90
|
||
# counterexample quorum : 0.10
|
||
#
|
||
# database : vertebrata
|
||
# Database is constituted of 3909 examples corresponding to 3876 species
|
||
# and 0 counterexamples corresponding to 0 species
|
||
#
|
||
# amplifiat length between [30,150] bp
|
||
# DB sequences are considered as circular
|
||
# Pairs having specificity less than 0.60 will be ignored
|
||
#
|
||
0 AGAGTGACGGGCGGTGTG CGTCAGGTCGAGGTGTAG 62.8 42.4 57.5 34.1 12 11 GG 3864 0 0.988 3832 0 0.989 2731 0.713 134 146 138.22
|
||
1 CGTCAGGTCGAGGTGTAG GAGTGACGGGCGGTGTGT 57.5 34.1 63.1 42.9 11 12 GG 3863 0 0.988 3831 0 0.988 2730 0.713 133 145 137.22
|
||
2 CGTCAGGTCGAGGTGTAG GGGAGAGTGACGGGCGGT 57.5 34.1 64.5 37.0 11 13 GG 3811 0 0.975 3779 0 0.975 2689 0.712 137 149 141.22
|
||
3 CGTCAGGTCGAGGTGTAG GGGGAGAGTGACGGGCGG 57.5 34.1 65.5 38.4 11 14 GG 3804 0 0.973 3772 0 0.973 2682 0.711 138 149 142.22
|
||
4 ACACCGCCCGTCACTCTC ACCTTCCGGTACACTTAC 62.5 36.8 54.0 16.6 12 9 GG 3850 0 0.985 3818 0 0.985 2658 0.696 46 132 66.51
|
||
5 AACGTCAGGTCGAGGTGT AGAGTGACGGGCGGTGTG 58.8 28.4 62.8 41.7 10 12 GG 3779 0 0.967 3746 0 0.966 2653 0.708 137 148 140.23
|
||
6 ACACCGCCCGTCACTCTC CACCTTCCGGTACACTTA 62.5 36.8 54.0 16.6 12 9 GG 3846 0 0.984 3814 0 0.984 2654 0.696 47 133 67.51
|
||
7 AACGTCAGGTCGAGGTGT GAGTGACGGGCGGTGTGT 58.8 28.4 63.1 42.1 10 12 GG 3778 0 0.966 3745 0 0.966 2652 0.708 136 147 139.23
|
||
8 ACCTTCCGGTACACTTAC CACACCGCCCGTCACTCT 54.0 16.6 62.8 37.3 9 12 GG 3845 0 0.984 3813 0 0.984 2653 0.696 47 133 67.51
|
||
9 ACACCGCCCGTCACTCTC TCCGGTACACTTACCATG 62.5 36.8 54.1 18.1 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 42 128 62.51
|
||
10 ACACCGCCCGTCACTCTC CCGGTACACTTACCATGT 62.5 36.8 54.4 18.6 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 41 127 61.51
|
||
11 ACACCGCCCGTCACTCTC CCAAGTGCACCTTCCGGT 62.5 36.8 60.7 28.9 12 11 GG 3837 0 0.982 3805 0 0.982 2650 0.696 54 140 74.51
|
||
12 ACACCGCCCGTCACTCTC GCACCTTCCGGTACACTT 62.5 36.8 57.7 22.5 12 10 GG 3842 0 0.983 3810 0 0.983 2650 0.696 48 134 68.51
|
||
13 ACACCGCCCGTCACTCTC CGGTACACTTACCATGTT 62.5 36.8 52.4 15.7 12 8 GG 3850 0 0.985 3818 0 0.985 2650 0.694 40 126 60.51
|
||
14 ACACCGCCCGTCACTCTC CACTTACCATGTTACGAC 62.5 36.8 51.1 27.7 12 8 GG 3850 0 0.985 3817 0 0.985 2649 0.694 35 121 55.51
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
- Primer ID : 11
|
||
|
||
|
||
|
||
| Primer | sequence | tm max | tm min | GC count |
|
||
|---------|--------------------|--------|--------|----------|
|
||
| Forward | ACACCGCCCGTCACTCTC | 62.5 | 36.8 | 12 |
|
||
| Reverse | CCAAGTGCACCTTCCGGT | 60.7 | 28.9 | 11 |
|
||
|
||
|
||
|
||
- amplifying 3837/3909 sequences\
|
||
- identify 2650/3876 Species
|
||
- Size ranging from 54bp to 140bp (mean: 74.75 bp)
|
||
|
||
## Testing the new primer pair
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obipcr --forward ACACCGCCCGTCACTCTC \
|
||
--reverse CCAAGTGCACCTTCCGGT \
|
||
-e 5 \
|
||
-l 30 -L 150 \
|
||
-c \
|
||
mito.vert.fasta \
|
||
> Teleostei_11.fasta
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
head Teleostei_11.fasta
|
||
```
|
||
|
||
``` csv
|
||
>NC_022183_sub[925..998] {"count":1,"definition":"Acrossocheilus hemispinus mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcaccctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Acrossocheilus hemispinus","species_name":"Acrossocheilus hemispinus","species_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","taxid":"taxon:356810 [Acrossocheilus hemispinus]@species"}
|
||
cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgt
|
||
aacatggtaagtgt
|
||
>NC_018560_sub[916..988] {"count":1,"definition":"Astatotilapia calliptera mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:8154 [Astatotilapia calliptera]@species","reverse_error":1,"reverse_match":"ccaagtacaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Astatotilapia calliptera (eastern happy)","species_name":"Astatotilapia calliptera","species_taxid":"taxon:8154 [Astatotilapia calliptera]@species","taxid":"taxon:8154 [Astatotilapia calliptera]@species"}
|
||
cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgta
|
||
acatggtaagtgt
|
||
>NC_056117_sub[923..997] {"count":1,"definition":"Pseudocrossocheilus tridentis mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Pseudocrossocheilus tridentis","species_name":"Pseudocrossocheilus tridentis","species_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species"}
|
||
ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcg
|
||
taacacggtaagtgt
|
||
>NC_045904_sub[919..997] {"count":1,"definition":"Eospalax fontanierii mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcgctctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:146134 [Eospalax fontanierii]@species","reverse_error":4,"reverse_match":"ccaagcacactttccagt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Eospalax fontanierii","species_name":"Eospalax fontanierii","species_taxid":"taxon:146134 [Eospalax fontanierii]@species","taxid":"taxon:146134 [Eospalax fontanierii]@species"}
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
convert the fasta file to csv
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obicsv --auto -s -i Teleostei_11.fasta > Teleostei_11.csv
|
||
```
|
||
|
||
and display the begining of the table
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
head Teleostei_11.csv | csvlook
|
||
```
|
||
|
||
``` csv
|
||
| id | count | direction | forward_error | forward_match | forward_primer | ori_taxid | reverse_error | reverse_match | reverse_primer | scientific_name | species_name | species_taxid | taxid | sequence |
|
||
| ------------------------- | ----- | --------- | ------------- | ------------------ | ------------------ | ---------------------------------------------------- | ------------- | ------------------ | ------------------ | ------------------------------------------------------ | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------- |
|
||
| NC_022183_sub[925..998] | True | forward | True | acaccgcccgtcaccctc | ACACCGCCCGTCACTCTC | taxon:356810 [Acrossocheilus hemispinus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Acrossocheilus hemispinus | Acrossocheilus hemispinus | taxon:356810 [Acrossocheilus hemispinus]@species | taxon:356810 [Acrossocheilus hemispinus]@species | cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgtaacatggtaagtgt |
|
||
| NC_018560_sub[916..988] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:8154 [Astatotilapia calliptera]@species | 1 | ccaagtacaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Astatotilapia calliptera (eastern happy) | Astatotilapia calliptera | taxon:8154 [Astatotilapia calliptera]@species | taxon:8154 [Astatotilapia calliptera]@species | cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgtaacatggtaagtgt |
|
||
| NC_056117_sub[923..997] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:887881 [Pseudocrossocheilus tridentis]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Pseudocrossocheilus tridentis | Pseudocrossocheilus tridentis | taxon:887881 [Pseudocrossocheilus tridentis]@species | taxon:887881 [Pseudocrossocheilus tridentis]@species | ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcgtaacacggtaagtgt |
|
||
| NC_045904_sub[919..997] | True | forward | True | acaccgcccgtcgctctc | ACACCGCCCGTCACTCTC | taxon:146134 [Eospalax fontanierii]@species | 4 | ccaagcacactttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Eospalax fontanierii | Eospalax fontanierii | taxon:146134 [Eospalax fontanierii]@species | taxon:146134 [Eospalax fontanierii]@species | ctcaagtacataaacttggatatattcttaataacccaacaaaaatattagaggagataagtcgtaacaaggtaagcat |
|
||
| NC_018546_sub[916..987] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:30732 [Oryzias melastigma]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Oryzias melastigma (Indian medaka) | Oryzias melastigma | taxon:30732 [Oryzias melastigma]@species | taxon:30732 [Oryzias melastigma]@species | cccgacccattttaaaaattaaataaaagatttcaggaactaaggggaggcaagtcgtaacatggtaagtgt |
|
||
| NC_044151_sub[922..993] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597641 [Sicyopterus squamosissimus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus squamosissimus (cling goby) | Sicyopterus squamosissimus | taxon:2597641 [Sicyopterus squamosissimus]@species | taxon:2597641 [Sicyopterus squamosissimus]@species | cccaaaacaaacacacacataaataagaaaaaatgaaaataaaggggaggcaagtcgtaacatggtaagtgt |
|
||
| NC_044152_sub[922..994] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597642 [Sicyopterus stiphodonoides]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus stiphodonoides (cling goby) | Sicyopterus stiphodonoides | taxon:2597642 [Sicyopterus stiphodonoides]@species | taxon:2597642 [Sicyopterus stiphodonoides]@species | cccaaaacaaacacacacataaataagaaaaaantgaaaataaaggggaggcaagtcgtaacatggtaagtgt |
|
||
| NC_026976_sub[1453..1531] | True | forward | True | acaccgcccgtcactccc | ACACCGCCCGTCACTCTC | taxon:9545 [Macaca nemestrina]@species | 1 | ccaagtgcaccttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Macaca nemestrina (pig-tailed macaque) | Macaca nemestrina | taxon:9545 [Macaca nemestrina]@species | taxon:9545 [Macaca nemestrina]@species | ctcaaatatatttaaggaacatcttaactaaacgccctaatatttatatagaggggataagtcgtaacatggtaagtgt |
|
||
| NC_031553_sub[921..995] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:643337 [Puntioplites proctozystron]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Puntioplites proctozystron | Puntioplites proctozystron | taxon:643337 [Puntioplites proctozystron]@species | taxon:643337 [Puntioplites proctozystron]@species | ccctgtcaaaacgcactaaaaatatctaatacaaaagcaccgacaaggggaggcaagtcgtaacacggtaagtgt |
|
||
```
|
||
|
||
# We are now switching to R
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Preparing our R session
|
||
|
||
First we have to download the two follong libraries
|
||
|
||
```{r}
|
||
#| echo: true
|
||
library(tidyverse)
|
||
library(ggpubr)
|
||
library(ROBITools4)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Loading the data
|
||
|
||
```{r}
|
||
#| echo: true
|
||
|
||
fish <- read_csv('Teleostei_11.csv', show_col_types = FALSE)
|
||
taxo <- read_ncbi_taxdump('ncbitaxo_20250211')
|
||
assign_default_taxonomy(taxo)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
Looking for Teleostei taxid
|
||
|
||
```{r}
|
||
#| echo: true
|
||
teleo_taxid <- ecofind('Teleostei')
|
||
teleo_taxid
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Format taxids
|
||
|
||
```{r}
|
||
#| echo: true
|
||
fish %>% mutate(
|
||
taxid = as_taxid(
|
||
as.integer(
|
||
str_split_fixed(
|
||
str_split_fixed(
|
||
taxid,pattern = " ",
|
||
n = 2)[,1],
|
||
":",
|
||
2)[,2]))
|
||
) %>%
|
||
as_tbl_obipcr() %>%
|
||
mutate(across(taxon(),
|
||
.names="category",
|
||
.fn=taxonomy_classifier(Teleostei = 32443))) %>%
|
||
group_by(category) %>%
|
||
mutate(weight = taxonomic_weights(taxid,taxo)) %>%
|
||
ungroup() -> fish
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## The fish tibble
|
||
|
||
```{r}
|
||
#| echo: true
|
||
head(fish,n = 4)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Identifying which sequences belongs fish
|
||
|
||
```{r}
|
||
#| echo: true
|
||
table(fish$category)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Testing the conservation of the priming sites
|
||
|
||
```{r}
|
||
#| echo: true
|
||
pssm_forward <- pssm(fish$forward_match,
|
||
weights = fish$weight,
|
||
categories = fish$category)
|
||
|
||
pssm_reverse <- pssm(fish$reverse_match,
|
||
weights = fish$weight,
|
||
categories = fish$category)
|
||
```
|
||
|
||
```{r}
|
||
#| echo: true
|
||
pssm_forward
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Rescaling the matrix as Shanon entropy
|
||
|
||
$$
|
||
H = - \sum_{i \in \{A,C,G,T\}} p_i \times \frac{\log(p_i)}{\log(2)}
|
||
$$
|
||
|
||
```{r}
|
||
#| echo: true
|
||
pssm_forward <- pssm_scale_shanon(pssm_forward)
|
||
pssm_reverse <- pssm_scale_shanon(pssm_reverse)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Display the rescaled matrix
|
||
|
||
```{r}
|
||
#| echo: true
|
||
pssm_forward
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## The DNA logo of our primer pair
|
||
|
||
```{r}
|
||
#| echo: true
|
||
flogo <- ggbarcodelogo(pssm_forward) +
|
||
xlab("Forward primer") + ylab("Bits")
|
||
rlogo <- ggbarcodelogo(pssm_reverse) +
|
||
xlab("Reverse primer") + ylab("Bits")
|
||
|
||
ggarrange(flogo,rlogo,ncol=2) -> dnaplot
|
||
|
||
dnaplot
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## How many mismatches ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
ggbarcodemistmatch(fish$forward_error,
|
||
fish$reverse_error,
|
||
otu=fish$species_name,
|
||
categories=fish$category) + theme_minimal()
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Are we discriminate taxa ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
with(fish %>% filter(category == "Teleostei"),
|
||
discriminated_at_rank(taxid,
|
||
c("species","genus","family","order"),
|
||
sequence))
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## How many sequences will provide information at rank ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
with(fish %>% filter(category == "Teleostei"),
|
||
discriminant_at_rank(taxid,
|
||
c("species","genus","family","order"),
|
||
sequence))
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Is it the same for *Cyprinidae* ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
cyprinidae_taxid <- ecofind('Cyprinidae')
|
||
cyprinidae_taxid
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Classify according to both categories
|
||
|
||
```{r}
|
||
#| echo: true
|
||
fish %>%
|
||
mutate(across(taxon(),
|
||
.names="category2",
|
||
.fn=taxonomy_classifier(Teleostei = 32443,
|
||
Cyprinidae = 7953))) -> fish
|
||
|
||
table(fish$category2)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Are we discriminate taxa ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
with(fish %>% filter(category2 == "Cyprinidae"),
|
||
discriminated_at_rank(taxid,
|
||
c("species","genus"),
|
||
sequence))
|
||
```
|
||
|
||
# Go back to unix
|
||
|
||
## We run ecoPrimers and ecoPCR on the select primer pair
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
ecoPrimers -d vertebrata \
|
||
-e 3 -3 2 \
|
||
-l 30 -L 150 \
|
||
-r 7953 -c > Cyprinidae.ecoprimers
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obipcr --forward ACGGCGTAAAGGGTGGTT \
|
||
--reverse TATCTAATCCCAGTTTGT \
|
||
-e 5 \
|
||
-l 30 -L 500 \
|
||
-c \
|
||
mito.vert.fasta \
|
||
> Cyprinidae_14.fasta
|
||
```
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| echo: true
|
||
obicsv --auto -s -i Cyprinidae_14.fasta > Cyprinidae_14.csv
|
||
```
|
||
|
||
# Go back to R
|
||
|
||
```{r}
|
||
cyprinidae <- read_csv('Cyprinidae_14.csv',
|
||
show_col_types = FALSE) %>%
|
||
mutate(
|
||
taxid = as_taxid(
|
||
as.integer(
|
||
str_split_fixed(
|
||
str_split_fixed(
|
||
taxid,pattern = " ",
|
||
n = 2)[,1],
|
||
":",
|
||
2)[,2]))
|
||
) %>%
|
||
as_tbl_obipcr() %>%
|
||
mutate(across(taxon(),
|
||
.names="category",
|
||
.fn=taxonomy_classifier(Teleostei = 32443,
|
||
Cyprinidae = 7953))) %>%
|
||
group_by(category) %>%
|
||
mutate(weight = taxonomic_weights(taxid,taxo)) %>%
|
||
ungroup()
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Identifying which sequences belongs fish and *Cyprinidae*
|
||
|
||
```{r}
|
||
#| echo: true
|
||
table(cyprinidae$category)
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Looking for conservation
|
||
|
||
```{r}
|
||
#| echo: true
|
||
pssm_forward <- pssm(cyprinidae$forward_match,
|
||
weights = cyprinidae$weight,
|
||
categories = cyprinidae$category) %>%
|
||
pssm_scale_shanon()
|
||
|
||
pssm_reverse <- pssm(cyprinidae$reverse_match,
|
||
weights = cyprinidae$weight,
|
||
categories = cyprinidae$category) %>%
|
||
pssm_scale_shanon()
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Plot the new DNA logo
|
||
|
||
```{r}
|
||
flogo <- ggbarcodelogo(pssm_forward) +
|
||
xlab("Forward primer") + ylab("Bits")
|
||
rlogo <- ggbarcodelogo(pssm_reverse) +
|
||
xlab("Reverse primer") + ylab("Bits")
|
||
|
||
ggarrange(flogo,rlogo,ncol=2) -> dnaplot
|
||
|
||
dnaplot
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## How many mismatches ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
ggbarcodemistmatch(cyprinidae$forward_error,
|
||
cyprinidae$reverse_error,
|
||
otu=cyprinidae$species_name,
|
||
categories=cyprinidae$category) + theme_minimal()
|
||
```
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
## Are we discriminate *Cyprinidae* taxa ?
|
||
|
||
```{r}
|
||
#| echo: true
|
||
with(cyprinidae %>% filter(category == "Cyprinidae"),
|
||
discriminated_at_rank(taxid,
|
||
c("species","genus"),
|
||
sequence))
|
||
``` |