Relecture de Pierre + ajout fabrication base de référence
This commit is contained in:
@ -46,7 +46,7 @@ The data needed to run the tutorial are the following:
|
|||||||
|
|
||||||
- the file describing the primers and tags used for all samples sequenced:
|
- the file describing the primers and tags used for all samples sequenced:
|
||||||
|
|
||||||
* ``wolf_ngsfilter.txt``
|
* ``wolf_diet_ngsfilter.txt``
|
||||||
The tags correspond to short and specific sequences added on the 5' end of each primer to distinguish the different samples
|
The tags correspond to short and specific sequences added on the 5' end of each primer to distinguish the different samples
|
||||||
|
|
||||||
- the file containing the reference database in fasta format:
|
- the file containing the reference database in fasta format:
|
||||||
@ -91,7 +91,7 @@ In our case, the command is:
|
|||||||
|
|
||||||
The :py:mod:`--score-min` option allows to avoid returning badly aligned sequence. If the alignment score is below 40, the
|
The :py:mod:`--score-min` option allows to avoid returning badly aligned sequence. If the alignment score is below 40, the
|
||||||
forward and reverse reads are not aligned but concatenated, and the value of the :py:mod:`mode` attribute is set to :py:mod:`joined`
|
forward and reverse reads are not aligned but concatenated, and the value of the :py:mod:`mode` attribute is set to :py:mod:`joined`
|
||||||
instead of :py:mod:`alignement`
|
instead of :py:mod:`alignment`
|
||||||
|
|
||||||
Remove not aligned sequence records
|
Remove not aligned sequence records
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@ -133,27 +133,27 @@ Assign each sequence record to its sample
|
|||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
Each sequence record is assigned to its original sample and to its experiment by using the information
|
Each sequence record is assigned to its original sample and to its experiment by using the information
|
||||||
provided in a text file (here ``wolf_ngsfilter.txt``). This text file contains one line per sample, with the name
|
provided in a text file (here ``wolf_diet_ngsfilter.txt``). This text file contains one line per sample, with the name
|
||||||
of the experiment (several experiment can be indicated in the same file), the name of the tags (for example: ``aattaac`` if the
|
of the experiment (several experiment can be indicated in the same file), the name of the tags (for example: ``aattaac`` if the
|
||||||
same tag has been used on each extremity of the PCR products, or ``aattaac:gaagtag`` if the tags are different), the sequence of the
|
same tag has been used on each extremity of the PCR products, or ``aattaac:gaagtag`` if the tags are different), the sequence of the
|
||||||
forward primer, the sequence of the reverse primer, the letter ``F`` or ``T`` for sample identification using the forward primer and tag
|
forward primer, the sequence of the reverse primer, the letter ``F`` or ``T`` for sample identification using the forward primer and tag
|
||||||
only or using both primers and both tags, respectively.
|
only or using both primers and both tags, respectively (see :doc:`ngsfilter <scripts/ngsfilter>` for details).
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.ngs.fastq
|
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.assigned.fastq
|
||||||
|
|
||||||
This command creates two files:
|
This command creates two files:
|
||||||
|
|
||||||
- ``unidentified.fastq`` containing all the sequence records that were not assigned to a sample
|
- ``unidentified.fastq`` containing all the sequence records that were not assigned to a sample
|
||||||
|
|
||||||
- ``wolf.ali.ngs.fastq`` containing all the sequence records that were properly assigned to a sample
|
- ``wolf.ali.assigned.fastq`` containing all the sequence records that were properly assigned to a sample
|
||||||
|
|
||||||
Note that each sequence record of the ``wolf.ali.ngs.fastq`` file contains only the barcode sequence
|
Note that each sequence record of the ``wolf.ali.assigned.fastq`` file contains only the barcode sequence
|
||||||
as the sequences of primers and tags were removed. The information concerning the experiment, the sample,
|
as the sequences of primers and tags were removed. The information concerning the experiment, the sample,
|
||||||
primers and the tags are added as several attributes in the sequence heading.
|
primers and the tags are added as several attributes in the sequence heading.
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.fastq`` is:
|
The first sequence record of ``wolf.ali.assigned.fastq`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -193,7 +193,7 @@ it is convenient to work with uniq *sequences* instead of *reads*. To *dereplica
|
|||||||
| original dataset (in this way, all duplicated reads are |
|
| original dataset (in this way, all duplicated reads are |
|
||||||
| removed) |
|
| removed) |
|
||||||
| |
|
| |
|
||||||
| Definition adapted from [#]_ |
|
| Definition adapted from Seguritan and Rohwer (2001) |
|
||||||
+-------------------------------------------------------------+
|
+-------------------------------------------------------------+
|
||||||
|
|
||||||
|
|
||||||
@ -201,26 +201,23 @@ We use the :doc:`obiuniq <scripts/obiuniq>` command with the `-m sample`. The `-
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obiuniq -m sample wolf.ali.ngs.fastq > wolf.ali.ngs.uniq.fasta
|
> obiuniq -m sample wolf.ali.assigned.fastq > wolf.ali.assigned.uniq.fasta
|
||||||
|
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.fasta`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB ali_length=61;
|
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP ali_length=61; seq_ab_match=47;
|
||||||
seq_ab_match=47; sminR=40.0; tail_quality=67.0;
|
sminR=40.0; tail_quality=67.0; reverse_match=ttagataccccactatgc; seq_a_deletion=1;
|
||||||
reverse_match=tagaacaggctcctctag; seq_a_deletion=1;
|
forward_match=tagaacaggctcctctag; forward_primer=tagaacaggctcctctag; reverse_primer=ttagataccccactatgc;
|
||||||
forward_match=ttagataccccactatgc; forward_primer=ttagataccccactatgc;
|
sminL=40.0; merged_sample={'29a_F260619': 1}; forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct;
|
||||||
reverse_primer=tagaacaggctcctctag; sminL=40.0; merged_sample={'29a_F260619': 1};
|
seq_b_mismatch=7; score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
|
||||||
forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct; seq_b_mismatch=7;
|
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0; direction=reverse; seq_b_insertion=0;
|
||||||
score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
|
experiment=wolf_diet; seq_b_deletion=1; seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct;
|
||||||
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0;
|
count=1; seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
|
||||||
direction=forward; seq_b_insertion=0; experiment=wolf_diet; seq_b_deletion=1;
|
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
|
||||||
seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct; count=1;
|
aatgattttgttataataattacttgtgtttagggctaa
|
||||||
seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
|
|
||||||
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
|
|
||||||
ggttaaaactcaaaggacttggcggtgctttataccctt
|
|
||||||
|
|
||||||
The run of :doc:`obiuniq <scripts/obiuniq>` has added two key=values entries in the header of the fasta sequence :
|
The run of :doc:`obiuniq <scripts/obiuniq>` has added two key=values entries in the header of the fasta sequence :
|
||||||
- :py:mod:`merged_sample={'29a_F260619': 1}` : this sequence have been found once in a single sample
|
- :py:mod:`merged_sample={'29a_F260619': 1}` : this sequence have been found once in a single sample
|
||||||
@ -232,46 +229,37 @@ To keep only these two ``key=value`` informations, we can use the :doc:`obiannot
|
|||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obiannotate -k count -k merged_sample \
|
> obiannotate -k count -k merged_sample \
|
||||||
wolf.ali.ngs.uniq.fasta > $$ ; mv $$ wolf.ali.ngs.uniq.fasta
|
wolf.ali.assigned.uniq.fasta > $$ ; mv $$ wolf.ali.assigned.uniq.fasta
|
||||||
|
|
||||||
|
|
||||||
The first five sequence records of ``wolf.ali.ngs.uniq.fasta`` becomes:
|
The first five sequence records of ``wolf.ali.assigned.uniq.fasta`` becomes:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB
|
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 1}; count=1;
|
||||||
merged_sample={'29a_F260619': 1}; count=1;
|
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
|
||||||
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
|
aatgattttgttataataattacttgtgtttagggctaa
|
||||||
ggttaaaactcaaaggacttggcggtgctttataccctt
|
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
|
||||||
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB
|
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||||
merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
|
gaacaattttgttatattaattacttgtgtttagggctaa
|
||||||
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaat
|
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
|
||||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
aagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggc
|
||||||
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB
|
gaataattttgttatattaattacttgtgtttagggctaa
|
||||||
merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
|
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
|
||||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaag
|
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
gaataattttgttatattaattacttgtgtttagggctaa
|
||||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
|
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB_CMP merged_sample={'26a_F040644': 10490}; count=10490;
|
||||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
|
agggatgtaaagcaccgccaagtcctttgagtttcaggctgttgctagtagtactctggc
|
||||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
gaacattcttgtttattgaatgtttatgtttagggctaa
|
||||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
|
||||||
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB
|
|
||||||
merged_sample={'26a_F040644': 10490}; count=10490;
|
|
||||||
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
|
|
||||||
gcctgaaactcaaaggacttggcggtgctttacatccct
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Denoising the sequence dataset
|
Denoising the sequence dataset
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
To have a set of sequences assigned to their original samples does not mean that all sequences
|
To have a set of sequences assigned to their original samples does not mean that all sequences
|
||||||
are *biologically* meaningful i.e. some of these sequences can contains
|
are *biologically* meaningful i.e. some of these sequences can contains PCR and/or sequencing
|
||||||
PCR and/or sequencing errors, or chimeras.
|
errors, or chimeras. To remove such sequences as much as possible, we first remove rare sequences
|
||||||
To remove such sequences as much as possible, we first remove rare sequences and then remove
|
and then remove sequences variants that likely correspond to artifacts.
|
||||||
sequences variants that likely correspond to artifacts from the original dataset.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -279,11 +267,11 @@ Get the counts statistics
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
In that case, we use :doc:`obistat <scripts/obistat>` to get the counting statistics on the 'count' attribute (the count attribute has been set by the :doc:`obiuniq <scripts/obiuniq>` command). By piping
|
In that case, we use :doc:`obistat <scripts/obistat>` to get the counting statistics on the 'count' attribute (the count attribute has been set by the :doc:`obiuniq <scripts/obiuniq>` command). By piping
|
||||||
the result in the unix commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
|
the result in the *Unix* commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obistat -c count wolf.ali.ngs.uniq.fasta | \
|
> obistat -c count wolf.ali.assigned.uniq.fasta | \
|
||||||
sort -nk1 | head -20
|
sort -nk1 | head -20
|
||||||
|
|
||||||
This print the output:
|
This print the output:
|
||||||
@ -318,107 +306,162 @@ The dataset contains 3504 sequences occurring only once.
|
|||||||
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp
|
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Based on the previous observation, we set the cut-off for keeping sequences for further analysis to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
|
Based on the previous observation, we set the cut-off for keeping sequences for further analysis
|
||||||
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be evaluated to :py:mod:`True` for each sequence to be kept. We also remove
|
to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
|
||||||
sequences with a length shorter than 80 bp (option -l).
|
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be
|
||||||
|
evaluated to :py:mod:`True` for each sequence to be kept. Based on previous knowledge we also remove
|
||||||
|
sequences with a length shorter than 80 bp (option -l) as we know that the amplified 12S-V5 barcode
|
||||||
|
for vertebrate must have a length arround 100bp.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obigrep -l 80 -p 'count>=10' wolf.ali.ngs.uniq.fasta \
|
> obigrep -l 80 -p 'count>=10' wolf.ali.assigned.uniq.fasta \
|
||||||
> wolf.ali.ngs.uniq.c10.l80.fasta
|
> wolf.ali.assigned.uniq.c10.l80.fasta
|
||||||
|
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.fasta`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP count=12335;
|
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB count=12335; merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
||||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
gaataattttgttatattaattacttgtgtttagggctaa
|
||||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Clean the sequences for PCR/sequencing errors (sequence variants)
|
Clean the sequences for PCR/sequencing errors (sequence variants)
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences (``-H`` option) that are sequences with no variants with greater count or
|
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences
|
||||||
sequences with no variants with 20-fold greater (``-r 0.05`` option).
|
(``-H`` option) that are sequences with no variants with greater count or sequences with no variants
|
||||||
|
with 20-fold greater (``-r 0.05`` option).
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obiclean -s merged_sample -r 0.05 -H \
|
> obiclean -s merged_sample -r 0.05 -H \
|
||||||
wolf.ali.ngs.uniq.c10.l80.fasta > wolf.ali.ngs.uniq.c10.l80.clean.fasta
|
wolf.ali.assigned.uniq.c10.l80.fasta > wolf.ali.assigned.uniq.c10.l80.clean.fasta
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.fasta`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
|
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB
|
||||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
||||||
obiclean_count={'29a_F260619': 5438, '15a_F730814': 8642}; obiclean_head=True;
|
obiclean_count={'29a_F260619': 5438, '15a_F730814': 8642}; obiclean_head=True;
|
||||||
obiclean_cluster={'29a_F260619':
|
obiclean_cluster={'29a_F260619': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB', '15a_F730814': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB'};
|
||||||
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP',
|
count=12335; obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814': 'h'};
|
||||||
'15a_F730814':
|
obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
|
||||||
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP'}; count=12335;
|
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||||
obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814':
|
gaataattttgttatattaattacttgtgtttagggctaa
|
||||||
'h'}; obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
|
|
||||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
|
||||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
|
||||||
|
|
||||||
|
|
||||||
Taxonomic assignment of sequences
|
Taxonomic assignment of sequences
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The taxonomic assignement of sequences requires a reference database compiling all possible species to be identified in the sample. The assignment is then done
|
Once denoising has been done, the next step in diet analysis is to relate the barcodes to their respective
|
||||||
based on sequence comparisons between the sample sequences and the reference sequences.
|
species in order to get the list of species associated to each sample.
|
||||||
|
|
||||||
|
The taxonomic assignement of sequences requires a reference database compiling all possible species to be
|
||||||
|
identified in the sample. The assignment is then done based on sequence comparisons between the sample
|
||||||
|
sequences and the reference sequences.
|
||||||
|
|
||||||
|
|
||||||
Build a reference database
|
Build a reference database
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
To build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two
|
As a rough way to build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate
|
||||||
primers (`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each read to its sample (file ``wolf_ngsfilter.txt``). Note that the primers must
|
a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers
|
||||||
be in the same order both in ``wolf_ngsfilter.txt`` and in the :doc`ecoPCR <scripts/ecoPCR>` command.
|
(`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each
|
||||||
|
read to its sample (file ``wolf_diet_ngsfilter.txt``).
|
||||||
|
|
||||||
|
The full list of steps to do in order to build this reference database would then be:
|
||||||
|
|
||||||
|
1. Download the whole set of EMBL sequences (available from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/)
|
||||||
|
2. Download the NCBI taxonomy (available from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
|
||||||
|
3. Format them into the ecoPCR format (set :doc:`obiconvert <scripts/obiconvert>` for how you can produce
|
||||||
|
ecoPCR compatible files)
|
||||||
|
4. Use ecoPCR to simulate amplification and build a reference database on the basis of putatively
|
||||||
|
amplified barcodes together with their recorded taxonomic information
|
||||||
|
|
||||||
|
As step 1 and step 3 can be really time consuming (about one day) we provide the reference database
|
||||||
|
produce by the following commands so that you can skip the following steps. Note that as both the EMBL database
|
||||||
|
and the taxonomic data can evolve dayly, if you run the following commands you may end up with quite different
|
||||||
|
results.
|
||||||
|
|
||||||
|
|
||||||
|
Note that any utility that allows downloading of files from a ftp site can be used. In the following commands,
|
||||||
|
we use the commonly used ``wget`` *Unix* command.
|
||||||
|
|
||||||
|
Download the sequences
|
||||||
|
......................
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
> mkdir EMBL
|
||||||
|
> cd EMBL
|
||||||
|
> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
|
||||||
|
> cd ..
|
||||||
|
|
||||||
|
Download the taxonomy
|
||||||
|
.....................
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
> mkdir TAXO
|
||||||
|
> cd TAXO
|
||||||
|
> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
|
||||||
|
> tar -zxvf taxdump.tar.gz
|
||||||
|
> cd ..
|
||||||
|
|
||||||
|
Format the data
|
||||||
|
...............
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
> obiconvert --embl -t ./TAXO --ecopcrDB-output=embl_last ./EMBL/*.dat
|
||||||
|
|
||||||
|
|
||||||
Retrieve the sequences
|
Retrieve the sequences
|
||||||
......................
|
......................
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> ecoPCR -d /Volumes/R0/Barcode-Leca/R117/embl_r117 -e 3 -l 50 -L 150 \
|
> ecoPCR -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
|
||||||
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05_r117.ecopcr
|
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
|
||||||
|
|
||||||
|
|
||||||
|
Note that the primers must be in the same order both
|
||||||
|
in ``wolf_diet_ngsfilter.txt`` and in the :doc:`ecoPCR <scripts/ecoPCR>` command.
|
||||||
|
|
||||||
|
|
||||||
Clean the database
|
Clean the database
|
||||||
..................
|
..................
|
||||||
|
|
||||||
|
|
||||||
1. filter the sequences so that they have a good taxonomic description at the species,
|
1. filter the sequences so that they have a good taxonomic description at the species,
|
||||||
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
|
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
|
||||||
|
|
||||||
2. remove redundant sequences (:doc:`obiuniq <scripts/obiuniq>` command below).
|
2. remove redundant sequences (:doc:`obiuniq <scripts/obiuniq>` command below).
|
||||||
|
|
||||||
3. ensure that the dereplicated sequences have a taxid at the family level
|
3. ensure that the dereplicated sequences have a taxid at the family level
|
||||||
(:doc:`obigrep <scripts/obigrep>` command below).
|
(:doc:`obigrep <scripts/obigrep>` command below).
|
||||||
|
4. ensure that sequences each have a uniq identification (:doc:`obiannotate <scripts/obiannotate>`
|
||||||
4. ensure that sequences each have a uniq identification (`awk` command below)
|
command below)
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=species \
|
> obigrep -d embl_last --require-rank=species \
|
||||||
--require-rank=genus --require-rank=family v05_r117.ecopcr > v05_r117_clean.fasta
|
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
|
||||||
|
|
||||||
> obiuniq -d /Volumes/R0/Barcode-Leca/R117/embl_r117 \
|
> obiuniq -d embl_last \
|
||||||
v05_r117_clean.fasta > v05_r117_clean_uniq.fasta
|
v05_clean.fasta > v05_clean_uniq.fasta
|
||||||
|
|
||||||
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=family \
|
> obigrep -d embl_last --require-rank=family \
|
||||||
v05_r117_clean_uniq.fasta > v05_r117_clean_uniq_clean.fasta
|
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
|
||||||
|
|
||||||
> awk '/^>/ && ($1 in nb) {nb[$1]++;$1=$1"."nb[$1];print;next;}/^>/{nb[$1]=0;}1' \
|
> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
|
||||||
v05_r117_clean_uniq_clean.fasta > db_v05_r117.fasta
|
|
||||||
|
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
From now, for the sake of clarity, the following commands will use the filenames of the provided data.
|
||||||
|
If you decided to run the last steps and use the files you have produced, you'll have to use
|
||||||
|
``db_v05.ecopcr`` instead of ``db_v05_r117.ecopcr`` and ``embl_last`` instead of ``embl_r117``
|
||||||
|
|
||||||
|
|
||||||
Assign each sequence to a taxon
|
Assign each sequence to a taxon
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
@ -428,8 +471,8 @@ the :doc:`ecotag <scripts/ecotag>` command.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.ngs.uniq.c10.l80.clean.fasta > \
|
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.assigned.uniq.c10.l80.clean.fasta > \
|
||||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta
|
wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta
|
||||||
|
|
||||||
|
|
||||||
The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among them are :
|
The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among them are :
|
||||||
@ -437,10 +480,10 @@ The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among th
|
|||||||
- best_match=ACCESSION where ACCESSION is the id of one the sequence in the reference database that best align to the query sequence;
|
- best_match=ACCESSION where ACCESSION is the id of one the sequence in the reference database that best align to the query sequence;
|
||||||
- best_identity=FLOAT where FLOAT*100 is the percentage identity between the best match sequence and the query sequence;
|
- best_identity=FLOAT where FLOAT*100 is the percentage identity between the best match sequence and the query sequence;
|
||||||
- taxid=TAXID where TAXID is the final assignation of the sequence by :doc:`ecotag <scripts/ecotag>`
|
- taxid=TAXID where TAXID is the final assignation of the sequence by :doc:`ecotag <scripts/ecotag>`
|
||||||
(may be different if the query sequence math to several sequences in the reference database);
|
|
||||||
- scientific_name=NAME where NAME is the scientific name of the assigned taxid.
|
- scientific_name=NAME where NAME is the scientific name of the assigned taxid.
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta`` is:
|
||||||
|
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -477,11 +520,11 @@ Some unuseful attributes can be removed.
|
|||||||
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
|
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
|
||||||
--delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount \
|
--delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount \
|
||||||
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name \
|
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name \
|
||||||
--delete-tag=order wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta > \
|
--delete-tag=order wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta > \
|
||||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta
|
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta
|
||||||
|
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -501,10 +544,10 @@ The sequences can be sorted in decreasing order of `count`.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obisort -k count -r wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta > \
|
> obisort -k count -r wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta > \
|
||||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta
|
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta
|
||||||
|
|
||||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
|
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -523,8 +566,8 @@ Finally, a tab delimited file that can be open by excel or R is generated.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
> obitab -o wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta > \
|
> obitab -o wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta > \
|
||||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.tab
|
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.tab
|
||||||
|
|
||||||
|
|
||||||
This file contains 26 sequences. You can deduce the diet of each sample:
|
This file contains 26 sequences. You can deduce the diet of each sample:
|
||||||
@ -545,6 +588,8 @@ References
|
|||||||
- Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers:
|
- Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers:
|
||||||
inference of new DNA barcode markers from whole genome sequence analysis. Nucleic
|
inference of new DNA barcode markers from whole genome sequence analysis. Nucleic
|
||||||
Acids Research, 39, e145.
|
Acids Research, 39, e145.
|
||||||
|
- Seguritan V, Rohwer F. (2001) FastGroup: a program to dereplicate libraries of
|
||||||
|
16S rDNA sequences. BMC Bioinformatics. 2001;2:9. Epub 2001 Oct 16.
|
||||||
|
|
||||||
|
|
||||||
Contact
|
Contact
|
||||||
|
Reference in New Issue
Block a user