Files
OBIJupyterHub/fish-primers.qmd

862 lines
32 KiB
Plaintext
Raw Permalink Normal View History

2025-11-16 14:27:42 +01:00
---
title: "Designing Fish primers"
format: revealjs
editor: visual
---
```{r setup, include=FALSE}
library(knitr)
```
# Preparing the data
------------------------------------------------------------------------
## What do we need ?
To design a new animal DNA metabarcode we download from the NCBI the following data
- The complete set of whole mitochondrial genomes
- The NCBI taxonomy
------------------------------------------------------------------------
## We need also:
- a Unix computer: a Mac or a Linux box
- A unix terminal window for typing commands
- Installed on the computer
- The OBITools --- <http://github.com/metabarcoding/obitools4>
- ecoPrimers --- <http://metabarcoding.org/ecoprimers>
- R --- <http://wwww.r-project.org>
------------------------------------------------------------------------
## Downloading the mitochondrial genomes
We can use an internet browser and download the files from NCBI FTP website
![](ncbi-ftp.png){fig-align="center"}
------------------------------------------------------------------------
## Downloading the mitochondrial genomes
We can use an internet browser and download the files from NCBI FTP website
or run the following command lines
```{bash}
#| eval: false
#| echo: true
curl 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/mitochondrion.1.genomic.gbff.gz' \
> mito.all.gb.gz
```
------------------------------------------------------------------------
```{bash}
#| eval: false
#| echo: true
zless mito.all.gb.gz
```
```
LOCUS NW_009243181 45189 bp DNA linear CON 06-OCT-2014
DEFINITION Fonticula alba strain ATCC 38817 mitochondrial scaffold
supercont2.211, whole genome shotgun sequence.
ACCESSION NW_009243181 NZ_AROH01000000
VERSION NW_009243181.1
DBLINK BioProject: PRJNA262900
Assembly: GCF_000388065.1
KEYWORDS WGS; RefSeq.
SOURCE mitochondrion Fonticula alba
ORGANISM Fonticula alba
Eukaryota; Rotosphaerida; Fonticulaceae; Fonticula.
REFERENCE 1 (bases 1 to 45189)
```
------------------------------------------------------------------------
## Downloading the complete taxonomy
```{bash}
#| eval: false
#| echo: true
obitaxonomy --download-ncbi
```
```
INFO[0000] Number of workers set 16
INFO[0000] Downloading NCBI Taxdump to ncbitaxo_20250211.tgz
downloading 100% ████████████████████████████████████████| (66/66 MB, 5.1 MB/s)
```
The NCBI taxonomy contains all the relationship between taxa. Each taxon is identified by a unique numerical id: `taxid`
------------------------------------------------------------------------
## The archive contains several files
file: `nodes.dmp`
```
1 | 1 | no rank | | 8 | 0 | ...
2 | 131567 | superkingdom | | 0 | 0 |
6 | 335928 | genus | | 0 | 1 |
7 | 6 | species | AC | 0 | 1 |
9 | 32199 | species | BA | 0 |
10 | 135621 | genus | | 0 |
11 | 1707 | species | CG | 0 | 1 |
13 | 203488 | genus | | 0 | 1 |
14 | 13 | species | DT | 0 | 1 |
```
------------------------------------------------------------------------
## file: `names.dmp`
```
1 | root | | scientific name |
2 | Bacteria | Bacteria <prokaryote> | scientific name |
2 | Monera | Monera <Bacteria> | in-part |
2 | Procaryotae | Procaryotae <Bacteria> | in-part |
2 | Prokaryota | Prokaryota <Bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <Bacteria> | in-part |
2 | bacteria | bacteria <blast2> | blast name |
2 | eubacteria | | genbank common name |
2 | prokaryote | prokaryote <Bacteria> | in-part |
...
10 | Cellvibrio | | scientific name |
11 | [Cellvibrio] gilvus | | scientific name |
13 | Dictyoglomus | | scientific name |
14 | Dictyoglomus thermophilum | | scientific name |
```
------------------------------------------------------------------------
## Preparing the set of complete genomes
```{bash}
#| eval: false
#| echo: true
obiconvert --skip-empty \
--update-taxid \
-t ncbitaxo_20250211.tgz \
mito.all.gb.gz \
> mito.all.fasta
head -5 mito.all.fasta
```
five first lines of the new `mito.all.fasta` file
```
>NC_072933 {"definition":"Echinosophora koreensis mitochondrion, complete genome.","scientific_name":"mitochondrion Echinosophora koreensis","taxid":228658}
ctttcgggtcggaaatagaagatctggattagatcccttctcgatagctttagtcagagc
tcatccctcgaaaaagggagtagtgagatgagaaaagggtgactagaatacggaaattca
actagtgaagtcagatccgggaattccactattgaagttatccgtcttaggcttcaagca
agctatctttcaaggaagtcagtctaagccctaagccaagatctgctttttgccagtcaa
```
------------------------------------------------------------------------
## We want:
- annotate sequences by their species `taxid`
- keep a single genome per species
- extract only vertebrate genome
------------------------------------------------------------------------
## Looking for the **Vertebrata**'s taxid
```{bash}
#| eval: false
#| echo: true
obitaxonomy -t ncbitaxo_20250211.tgz \
--fixed \
'vertebrata'
```
``` csv
taxid,parent,taxonomic_rank,scientific_name
taxon:1261581 [Vertebrata]@genus,taxon:2008651 [Polysiphonioideae]@subfamily,genus,Vertebrata
taxon:7742 [Vertebrata]@clade,taxon:89593 [Craniata]@subphylum,clade,Vertebrata
```
------------------------------------------------------------------------
## Looking for the **Vertebrata**'s taxid
```{bash}
#| eval: false
#| echo: true
obitaxonomy -t ncbitaxo_20250211.tgz \
--fixed \
'vertebrata' \
| csvlook
```
``` csv
| taxid | parent | taxonomic_rank | scientific_name |
| -------------------------------- | ------------------------------------------- | -------------- | --------------- |
| taxon:1261581 [Vertebrata]@genus | taxon:2008651 [Polysiphonioideae]@subfamily | genus | Vertebrata |
| taxon:7742 [Vertebrata]@clade | taxon:89593 [Craniata]@subphylum | clade | Vertebrata |
```
## A genus called **Vertebrata**
```{bash}
#| eval: false
#| echo: true
obitaxonomy -t ncbitaxo_20250211.tgz \
-p 2008651 \
| csvlook
```
``` csv
| taxid | parent | taxonomic_rank | scientific_name |
| ------------------------------------------- | ------------------------------------------- | -------------- | ------------------ |
| taxon:2008651 [Polysiphonioideae]@subfamily | taxon:2803 [Rhodomelaceae]@family | subfamily | Polysiphonioideae |
| taxon:2803 [Rhodomelaceae]@family | taxon:2802 [Ceramiales]@order | family | Rhodomelaceae |
| taxon:2802 [Ceramiales]@order | taxon:2045261 [Rhodymeniophycidae]@subclass | order | Ceramiales |
| taxon:2045261 [Rhodymeniophycidae]@subclass | taxon:2806 [Florideophyceae]@class | subclass | Rhodymeniophycidae |
| taxon:2806 [Florideophyceae]@class | taxon:2763 [Rhodophyta]@phylum | class | Florideophyceae |
| taxon:2763 [Rhodophyta]@phylum | taxon:2759 [Eukaryota]@superkingdom | phylum | Rhodophyta |
| taxon:2759 [Eukaryota]@superkingdom | taxon:131567 [cellular organisms]@no rank | superkingdom | Eukaryota |
| taxon:131567 [cellular organisms]@no rank | taxon:1 [root]@no rank | no rank | cellular organisms |
| taxon:1 [root]@no rank | taxon:1 [root]@no rank | no rank | root |
```
------------------------------------------------------------------------
## Reannotation and selection of the genomes
```{bash}
#| eval: false
#| echo: true
obiannotate -t ncbitaxo_20250211.tgz \
--with-taxon-at-rank=species \
mito.all.fasta | \
obiannotate -S 'ori_taxid=annotations.taxid' | \
obiannotate -S 'taxid=annotations.species_taxid' | \
obiuniq -c taxid > mito.one.fasta
```
------------------------------------------------------------------------
## Species representation
```{bash}
#| eval: false
#| echo: true
obicsv -k taxid mito.one.fasta \
| tail -n +2 \
| sort \
| uniq -c \
| sort -nk1 \
| cut -w -f 2 \
| uplot count
```
```
┌ ┐
1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 17769.0
2 ┤ 90.0
3 ┤ 17.0
4 ┤ 5.0
5 ┤ 4.0
6 ┤ 2.0
7 ┤ 1.0
└ ┘
```
------------------------------------------------------------------------
## Selection of the vertebrata genomes
```{bash}
#| eval: false
#| echo: true
obigrep -t ncbitaxo_20250211.tgz \
-r 7742 \
mito.one.fasta > mito.vert.fasta
```
```{bash}
#| eval: false
#| echo: true
obicount mito.vert.fasta \
| csvlook
```
```
| entities | n |
| -------- | ----------- |
| variants | 7,822 |
| reads | 7,823 |
| symbols | 131,378,756 |
```
------------------------------------------------------------------------
## Prepare data for ecoPrimers 1/3
```{bash}
#| eval: false
#| echo: true
mkdir ncbitaxo_20250211
cd ncbitaxo_20250211
tar zxvf ../ncbitaxo_20250211.tgz
cd ..
```
------------------------------------------------------------------------
## Prepare data for ecoPrimers 2/3
```{bash}
#| eval: false
#| echo: true
obiconvert -O mito.vert.fasta > mito.vert.old.fasta
```
```{bash}
#| eval: false
#| echo: true
head -5 mito.vert.old.fasta
```
``` csv
>NC_071784 taxid=taxon:2065826 [Sineleotris saccharae]@species; count=1; ori_taxid=taxon:2065826 [Sineleotris saccharae]@species; scientific_name=mitochondrion Sineleotris saccharae; species_name=Sineleotris saccharae; species_taxid=taxon:2065826 [Sineleotris saccharae]@species; Sineleotris saccharae mitochondrion, complete genome.
gctagcgtagcttaaccaaagcataacactgaagatgttaagatgggccctagaaagccc
cgcaagcacaaaagcttggtcctggctttactatcagcttaggctaaacttacacatgca
agtatccgcatccccgtgagaatgcccttaagctcccaccgctaacaggagtcaaggagc
cggtatcaggcacaaccctgagttagcccacgacaccttgctcagccacacccccaaggg
```
------------------------------------------------------------------------
## Prepare data for ecoPrimers 3/3
```{bash}
#| eval: false
#| echo: true
ecoPCRFormat -t ncbitaxo_20250211 \
-f \
-n vertebrata \
mito.vert.old.fasta
```
```{bash}
#| eval: false
#| echo: true
ls -l vertebrata*
```
```
-rw-r--r--@ 1 coissac staff 260899785 Feb 11 11:53 vertabrata.ndx
-rw-r--r--@ 1 coissac staff 546 Feb 11 11:53 vertabrata.rdx
-rw-r--r--@ 1 coissac staff 121379751 Feb 11 11:53 vertabrata.tdx
-rw-r--r--@ 1 coissac staff 40446318 Feb 11 11:54 vertabrata_001.sdx
```
------------------------------------------------------------------------
## Looking for the *Teleostei* `taxid`
```{bash}
#| eval: false
#| echo: true
obitaxonomy -t ncbitaxo_20250211.tgz \
--fixed \
'Teleostei' \
| csvlook
```
``` csv
| taxid | parent | taxonomic_rank | scientific_name |
| ---------------------------------- | ---------------------------------- | -------------- | --------------- |
| taxon:32443 [Teleostei]@infraclass | taxon:41665 [Neopterygii]@subclass | infraclass | Teleostei |
```
------------------------------------------------------------------------
## Selecting the best primer pairs
```{bash}
#| eval: false
#| echo: true
ecoPrimers -d vertebrata \
-e 3 -3 2 \
-l 30 -L 150 \
-r 32443 \
-c > Teleostei.ecoprimers
```
- Total pair count : 9407
- Total good pair count : 407
------------------------------------------------------------------------
```{bash}
#| eval: false
#| echo: true
head -35 Teleostei.ecoprimers
```
``` csv
#
# ecoPrimer version 0.5
# Rank level optimisation : species
# max error count by oligonucleotide : 3
#
# Restricted to taxon:
# 32443 : Teleostei (infraclass)
#
# strict primer quorum : 0.70
# example quorum : 0.90
# counterexample quorum : 0.10
#
# database : vertebrata
# Database is constituted of 3909 examples corresponding to 3876 species
# and 0 counterexamples corresponding to 0 species
#
# amplifiat length between [30,150] bp
# DB sequences are considered as circular
# Pairs having specificity less than 0.60 will be ignored
#
0 AGAGTGACGGGCGGTGTG CGTCAGGTCGAGGTGTAG 62.8 42.4 57.5 34.1 12 11 GG 3864 0 0.988 3832 0 0.989 2731 0.713 134 146 138.22
1 CGTCAGGTCGAGGTGTAG GAGTGACGGGCGGTGTGT 57.5 34.1 63.1 42.9 11 12 GG 3863 0 0.988 3831 0 0.988 2730 0.713 133 145 137.22
2 CGTCAGGTCGAGGTGTAG GGGAGAGTGACGGGCGGT 57.5 34.1 64.5 37.0 11 13 GG 3811 0 0.975 3779 0 0.975 2689 0.712 137 149 141.22
3 CGTCAGGTCGAGGTGTAG GGGGAGAGTGACGGGCGG 57.5 34.1 65.5 38.4 11 14 GG 3804 0 0.973 3772 0 0.973 2682 0.711 138 149 142.22
4 ACACCGCCCGTCACTCTC ACCTTCCGGTACACTTAC 62.5 36.8 54.0 16.6 12 9 GG 3850 0 0.985 3818 0 0.985 2658 0.696 46 132 66.51
5 AACGTCAGGTCGAGGTGT AGAGTGACGGGCGGTGTG 58.8 28.4 62.8 41.7 10 12 GG 3779 0 0.967 3746 0 0.966 2653 0.708 137 148 140.23
6 ACACCGCCCGTCACTCTC CACCTTCCGGTACACTTA 62.5 36.8 54.0 16.6 12 9 GG 3846 0 0.984 3814 0 0.984 2654 0.696 47 133 67.51
7 AACGTCAGGTCGAGGTGT GAGTGACGGGCGGTGTGT 58.8 28.4 63.1 42.1 10 12 GG 3778 0 0.966 3745 0 0.966 2652 0.708 136 147 139.23
8 ACCTTCCGGTACACTTAC CACACCGCCCGTCACTCT 54.0 16.6 62.8 37.3 9 12 GG 3845 0 0.984 3813 0 0.984 2653 0.696 47 133 67.51
9 ACACCGCCCGTCACTCTC TCCGGTACACTTACCATG 62.5 36.8 54.1 18.1 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 42 128 62.51
10 ACACCGCCCGTCACTCTC CCGGTACACTTACCATGT 62.5 36.8 54.4 18.6 12 9 GG 3851 0 0.985 3819 0 0.985 2651 0.694 41 127 61.51
11 ACACCGCCCGTCACTCTC CCAAGTGCACCTTCCGGT 62.5 36.8 60.7 28.9 12 11 GG 3837 0 0.982 3805 0 0.982 2650 0.696 54 140 74.51
12 ACACCGCCCGTCACTCTC GCACCTTCCGGTACACTT 62.5 36.8 57.7 22.5 12 10 GG 3842 0 0.983 3810 0 0.983 2650 0.696 48 134 68.51
13 ACACCGCCCGTCACTCTC CGGTACACTTACCATGTT 62.5 36.8 52.4 15.7 12 8 GG 3850 0 0.985 3818 0 0.985 2650 0.694 40 126 60.51
14 ACACCGCCCGTCACTCTC CACTTACCATGTTACGAC 62.5 36.8 51.1 27.7 12 8 GG 3850 0 0.985 3817 0 0.985 2649 0.694 35 121 55.51
```
------------------------------------------------------------------------
- Primer ID : 11
 
| Primer | sequence | tm max | tm min | GC count |
|---------|--------------------|--------|--------|----------|
| Forward | ACACCGCCCGTCACTCTC | 62.5 | 36.8 | 12 |
| Reverse | CCAAGTGCACCTTCCGGT | 60.7 | 28.9 | 11 |
 
- amplifying 3837/3909 sequences\
- identify 2650/3876 Species
- Size ranging from 54bp to 140bp (mean: 74.75 bp)
## Testing the new primer pair
```{bash}
#| eval: false
#| echo: true
obipcr --forward ACACCGCCCGTCACTCTC \
--reverse CCAAGTGCACCTTCCGGT \
-e 5 \
-l 30 -L 150 \
-c \
mito.vert.fasta \
> Teleostei_11.fasta
```
```{bash}
#| eval: false
#| echo: true
head Teleostei_11.fasta
```
``` csv
>NC_022183_sub[925..998] {"count":1,"definition":"Acrossocheilus hemispinus mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcaccctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Acrossocheilus hemispinus","species_name":"Acrossocheilus hemispinus","species_taxid":"taxon:356810 [Acrossocheilus hemispinus]@species","taxid":"taxon:356810 [Acrossocheilus hemispinus]@species"}
cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgt
aacatggtaagtgt
>NC_018560_sub[916..988] {"count":1,"definition":"Astatotilapia calliptera mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:8154 [Astatotilapia calliptera]@species","reverse_error":1,"reverse_match":"ccaagtacaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Astatotilapia calliptera (eastern happy)","species_name":"Astatotilapia calliptera","species_taxid":"taxon:8154 [Astatotilapia calliptera]@species","taxid":"taxon:8154 [Astatotilapia calliptera]@species"}
cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgta
acatggtaagtgt
>NC_056117_sub[923..997] {"count":1,"definition":"Pseudocrossocheilus tridentis mitochondrion, complete genome.","direction":"forward","forward_error":0,"forward_match":"acaccgcccgtcactctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","reverse_error":0,"reverse_match":"ccaagtgcaccttccggt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Pseudocrossocheilus tridentis","species_name":"Pseudocrossocheilus tridentis","species_taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species","taxid":"taxon:887881 [Pseudocrossocheilus tridentis]@species"}
ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcg
taacacggtaagtgt
>NC_045904_sub[919..997] {"count":1,"definition":"Eospalax fontanierii mitochondrion, complete genome.","direction":"forward","forward_error":1,"forward_match":"acaccgcccgtcgctctc","forward_primer":"ACACCGCCCGTCACTCTC","ori_taxid":"taxon:146134 [Eospalax fontanierii]@species","reverse_error":4,"reverse_match":"ccaagcacactttccagt","reverse_primer":"CCAAGTGCACCTTCCGGT","scientific_name":"mitochondrion Eospalax fontanierii","species_name":"Eospalax fontanierii","species_taxid":"taxon:146134 [Eospalax fontanierii]@species","taxid":"taxon:146134 [Eospalax fontanierii]@species"}
```
------------------------------------------------------------------------
convert the fasta file to csv
```{bash}
#| eval: false
#| echo: true
obicsv --auto -s -i Teleostei_11.fasta > Teleostei_11.csv
```
and display the begining of the table
```{bash}
#| eval: false
#| echo: true
head Teleostei_11.csv | csvlook
```
``` csv
| id | count | direction | forward_error | forward_match | forward_primer | ori_taxid | reverse_error | reverse_match | reverse_primer | scientific_name | species_name | species_taxid | taxid | sequence |
| ------------------------- | ----- | --------- | ------------- | ------------------ | ------------------ | ---------------------------------------------------- | ------------- | ------------------ | ------------------ | ------------------------------------------------------ | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------- |
| NC_022183_sub[925..998] | True | forward | True | acaccgcccgtcaccctc | ACACCGCCCGTCACTCTC | taxon:356810 [Acrossocheilus hemispinus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Acrossocheilus hemispinus | Acrossocheilus hemispinus | taxon:356810 [Acrossocheilus hemispinus]@species | taxon:356810 [Acrossocheilus hemispinus]@species | cccgtcaaaatacaccaaaaatacttaatacaataacactaacaaggggaggcaagtcgtaacatggtaagtgt |
| NC_018560_sub[916..988] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:8154 [Astatotilapia calliptera]@species | 1 | ccaagtacaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Astatotilapia calliptera (eastern happy) | Astatotilapia calliptera | taxon:8154 [Astatotilapia calliptera]@species | taxon:8154 [Astatotilapia calliptera]@species | cccaagccaacaacatcctataaataatacattttaccggtaaaggggaggcaagtcgtaacatggtaagtgt |
| NC_056117_sub[923..997] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:887881 [Pseudocrossocheilus tridentis]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Pseudocrossocheilus tridentis | Pseudocrossocheilus tridentis | taxon:887881 [Pseudocrossocheilus tridentis]@species | taxon:887881 [Pseudocrossocheilus tridentis]@species | ccctgtcaaaaagcatcaaatatatataataaattagcaatgacaaggggaggcaagtcgtaacacggtaagtgt |
| NC_045904_sub[919..997] | True | forward | True | acaccgcccgtcgctctc | ACACCGCCCGTCACTCTC | taxon:146134 [Eospalax fontanierii]@species | 4 | ccaagcacactttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Eospalax fontanierii | Eospalax fontanierii | taxon:146134 [Eospalax fontanierii]@species | taxon:146134 [Eospalax fontanierii]@species | ctcaagtacataaacttggatatattcttaataacccaacaaaaatattagaggagataagtcgtaacaaggtaagcat |
| NC_018546_sub[916..987] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:30732 [Oryzias melastigma]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Oryzias melastigma (Indian medaka) | Oryzias melastigma | taxon:30732 [Oryzias melastigma]@species | taxon:30732 [Oryzias melastigma]@species | cccgacccattttaaaaattaaataaaagatttcaggaactaaggggaggcaagtcgtaacatggtaagtgt |
| NC_044151_sub[922..993] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597641 [Sicyopterus squamosissimus]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus squamosissimus (cling goby) | Sicyopterus squamosissimus | taxon:2597641 [Sicyopterus squamosissimus]@species | taxon:2597641 [Sicyopterus squamosissimus]@species | cccaaaacaaacacacacataaataagaaaaaatgaaaataaaggggaggcaagtcgtaacatggtaagtgt |
| NC_044152_sub[922..994] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:2597642 [Sicyopterus stiphodonoides]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Sicyopterus stiphodonoides (cling goby) | Sicyopterus stiphodonoides | taxon:2597642 [Sicyopterus stiphodonoides]@species | taxon:2597642 [Sicyopterus stiphodonoides]@species | cccaaaacaaacacacacataaataagaaaaaantgaaaataaaggggaggcaagtcgtaacatggtaagtgt |
| NC_026976_sub[1453..1531] | True | forward | True | acaccgcccgtcactccc | ACACCGCCCGTCACTCTC | taxon:9545 [Macaca nemestrina]@species | 1 | ccaagtgcaccttccagt | CCAAGTGCACCTTCCGGT | mitochondrion Macaca nemestrina (pig-tailed macaque) | Macaca nemestrina | taxon:9545 [Macaca nemestrina]@species | taxon:9545 [Macaca nemestrina]@species | ctcaaatatatttaaggaacatcttaactaaacgccctaatatttatatagaggggataagtcgtaacatggtaagtgt |
| NC_031553_sub[921..995] | True | forward | False | acaccgcccgtcactctc | ACACCGCCCGTCACTCTC | taxon:643337 [Puntioplites proctozystron]@species | 0 | ccaagtgcaccttccggt | CCAAGTGCACCTTCCGGT | mitochondrion Puntioplites proctozystron | Puntioplites proctozystron | taxon:643337 [Puntioplites proctozystron]@species | taxon:643337 [Puntioplites proctozystron]@species | ccctgtcaaaacgcactaaaaatatctaatacaaaagcaccgacaaggggaggcaagtcgtaacacggtaagtgt |
```
# We are now switching to R
------------------------------------------------------------------------
## Preparing our R session
First we have to download the two follong libraries
```{r}
#| echo: true
library(tidyverse)
library(ggpubr)
library(ROBITools4)
```
------------------------------------------------------------------------
## Loading the data
```{r}
#| echo: true
fish <- read_csv('Teleostei_11.csv', show_col_types = FALSE)
taxo <- read_ncbi_taxdump('ncbitaxo_20250211')
assign_default_taxonomy(taxo)
```
------------------------------------------------------------------------
Looking for Teleostei taxid
```{r}
#| echo: true
teleo_taxid <- ecofind('Teleostei')
teleo_taxid
```
------------------------------------------------------------------------
## Format taxids
```{r}
#| echo: true
fish %>% mutate(
taxid = as_taxid(
as.integer(
str_split_fixed(
str_split_fixed(
taxid,pattern = " ",
n = 2)[,1],
":",
2)[,2]))
) %>%
as_tbl_obipcr() %>%
mutate(across(taxon(),
.names="category",
.fn=taxonomy_classifier(Teleostei = 32443))) %>%
group_by(category) %>%
mutate(weight = taxonomic_weights(taxid,taxo)) %>%
ungroup() -> fish
```
------------------------------------------------------------------------
## The fish tibble
```{r}
#| echo: true
head(fish,n = 4)
```
------------------------------------------------------------------------
## Identifying which sequences belongs fish
```{r}
#| echo: true
table(fish$category)
```
------------------------------------------------------------------------
## Testing the conservation of the priming sites
```{r}
#| echo: true
pssm_forward <- pssm(fish$forward_match,
weights = fish$weight,
categories = fish$category)
pssm_reverse <- pssm(fish$reverse_match,
weights = fish$weight,
categories = fish$category)
```
```{r}
#| echo: true
pssm_forward
```
------------------------------------------------------------------------
## Rescaling the matrix as Shanon entropy
$$
H = - \sum_{i \in \{A,C,G,T\}} p_i \times \frac{\log(p_i)}{\log(2)}
$$
```{r}
#| echo: true
pssm_forward <- pssm_scale_shanon(pssm_forward)
pssm_reverse <- pssm_scale_shanon(pssm_reverse)
```
------------------------------------------------------------------------
## Display the rescaled matrix
```{r}
#| echo: true
pssm_forward
```
------------------------------------------------------------------------
## The DNA logo of our primer pair
```{r}
#| echo: true
flogo <- ggbarcodelogo(pssm_forward) +
xlab("Forward primer") + ylab("Bits")
rlogo <- ggbarcodelogo(pssm_reverse) +
xlab("Reverse primer") + ylab("Bits")
ggarrange(flogo,rlogo,ncol=2) -> dnaplot
dnaplot
```
------------------------------------------------------------------------
## How many mismatches ?
```{r}
#| echo: true
ggbarcodemistmatch(fish$forward_error,
fish$reverse_error,
otu=fish$species_name,
categories=fish$category) + theme_minimal()
```
------------------------------------------------------------------------
## Are we discriminate taxa ?
```{r}
#| echo: true
with(fish %>% filter(category == "Teleostei"),
discriminated_at_rank(taxid,
c("species","genus","family","order"),
sequence))
```
------------------------------------------------------------------------
## How many sequences will provide information at rank ?
```{r}
#| echo: true
with(fish %>% filter(category == "Teleostei"),
discriminant_at_rank(taxid,
c("species","genus","family","order"),
sequence))
```
------------------------------------------------------------------------
## Is it the same for *Cyprinidae* ?
```{r}
#| echo: true
cyprinidae_taxid <- ecofind('Cyprinidae')
cyprinidae_taxid
```
------------------------------------------------------------------------
## Classify according to both categories
```{r}
#| echo: true
fish %>%
mutate(across(taxon(),
.names="category2",
.fn=taxonomy_classifier(Teleostei = 32443,
Cyprinidae = 7953))) -> fish
table(fish$category2)
```
------------------------------------------------------------------------
## Are we discriminate taxa ?
```{r}
#| echo: true
with(fish %>% filter(category2 == "Cyprinidae"),
discriminated_at_rank(taxid,
c("species","genus"),
sequence))
```
# Go back to unix
## We run ecoPrimers and ecoPCR on the select primer pair
```{bash}
#| eval: false
#| echo: true
ecoPrimers -d vertebrata \
-e 3 -3 2 \
-l 30 -L 150 \
-r 7953 -c > Cyprinidae.ecoprimers
```
```{bash}
#| eval: false
#| echo: true
obipcr --forward ACGGCGTAAAGGGTGGTT \
--reverse TATCTAATCCCAGTTTGT \
-e 5 \
-l 30 -L 500 \
-c \
mito.vert.fasta \
> Cyprinidae_14.fasta
```
```{bash}
#| eval: false
#| echo: true
obicsv --auto -s -i Cyprinidae_14.fasta > Cyprinidae_14.csv
```
# Go back to R
```{r}
cyprinidae <- read_csv('Cyprinidae_14.csv',
show_col_types = FALSE) %>%
mutate(
taxid = as_taxid(
as.integer(
str_split_fixed(
str_split_fixed(
taxid,pattern = " ",
n = 2)[,1],
":",
2)[,2]))
) %>%
as_tbl_obipcr() %>%
mutate(across(taxon(),
.names="category",
.fn=taxonomy_classifier(Teleostei = 32443,
Cyprinidae = 7953))) %>%
group_by(category) %>%
mutate(weight = taxonomic_weights(taxid,taxo)) %>%
ungroup()
```
------------------------------------------------------------------------
## Identifying which sequences belongs fish and *Cyprinidae*
```{r}
#| echo: true
table(cyprinidae$category)
```
------------------------------------------------------------------------
## Looking for conservation
```{r}
#| echo: true
pssm_forward <- pssm(cyprinidae$forward_match,
weights = cyprinidae$weight,
categories = cyprinidae$category) %>%
pssm_scale_shanon()
pssm_reverse <- pssm(cyprinidae$reverse_match,
weights = cyprinidae$weight,
categories = cyprinidae$category) %>%
pssm_scale_shanon()
```
------------------------------------------------------------------------
## Plot the new DNA logo
```{r}
flogo <- ggbarcodelogo(pssm_forward) +
xlab("Forward primer") + ylab("Bits")
rlogo <- ggbarcodelogo(pssm_reverse) +
xlab("Reverse primer") + ylab("Bits")
ggarrange(flogo,rlogo,ncol=2) -> dnaplot
dnaplot
```
------------------------------------------------------------------------
## How many mismatches ?
```{r}
#| echo: true
ggbarcodemistmatch(cyprinidae$forward_error,
cyprinidae$reverse_error,
otu=cyprinidae$species_name,
categories=cyprinidae$category) + theme_minimal()
```
------------------------------------------------------------------------
## Are we discriminate *Cyprinidae* taxa ?
```{r}
#| echo: true
with(cyprinidae %>% filter(category == "Cyprinidae"),
discriminated_at_rank(taxid,
c("species","genus"),
sequence))
```