Relecture de Pierre + ajout fabrication base de référence
This commit is contained in:
@ -46,7 +46,7 @@ The data needed to run the tutorial are the following:
|
||||
|
||||
- the file describing the primers and tags used for all samples sequenced:
|
||||
|
||||
* ``wolf_ngsfilter.txt``
|
||||
* ``wolf_diet_ngsfilter.txt``
|
||||
The tags correspond to short and specific sequences added on the 5' end of each primer to distinguish the different samples
|
||||
|
||||
- the file containing the reference database in fasta format:
|
||||
@ -91,7 +91,7 @@ In our case, the command is:
|
||||
|
||||
The :py:mod:`--score-min` option allows to avoid returning badly aligned sequence. If the alignment score is below 40, the
|
||||
forward and reverse reads are not aligned but concatenated, and the value of the :py:mod:`mode` attribute is set to :py:mod:`joined`
|
||||
instead of :py:mod:`alignement`
|
||||
instead of :py:mod:`alignment`
|
||||
|
||||
Remove not aligned sequence records
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -133,27 +133,27 @@ Assign each sequence record to its sample
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Each sequence record is assigned to its original sample and to its experiment by using the information
|
||||
provided in a text file (here ``wolf_ngsfilter.txt``). This text file contains one line per sample, with the name
|
||||
provided in a text file (here ``wolf_diet_ngsfilter.txt``). This text file contains one line per sample, with the name
|
||||
of the experiment (several experiment can be indicated in the same file), the name of the tags (for example: ``aattaac`` if the
|
||||
same tag has been used on each extremity of the PCR products, or ``aattaac:gaagtag`` if the tags are different), the sequence of the
|
||||
forward primer, the sequence of the reverse primer, the letter ``F`` or ``T`` for sample identification using the forward primer and tag
|
||||
only or using both primers and both tags, respectively.
|
||||
only or using both primers and both tags, respectively (see :doc:`ngsfilter <scripts/ngsfilter>` for details).
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.ngs.fastq
|
||||
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.assigned.fastq
|
||||
|
||||
This command creates two files:
|
||||
|
||||
- ``unidentified.fastq`` containing all the sequence records that were not assigned to a sample
|
||||
|
||||
- ``wolf.ali.ngs.fastq`` containing all the sequence records that were properly assigned to a sample
|
||||
- ``wolf.ali.assigned.fastq`` containing all the sequence records that were properly assigned to a sample
|
||||
|
||||
Note that each sequence record of the ``wolf.ali.ngs.fastq`` file contains only the barcode sequence
|
||||
Note that each sequence record of the ``wolf.ali.assigned.fastq`` file contains only the barcode sequence
|
||||
as the sequences of primers and tags were removed. The information concerning the experiment, the sample,
|
||||
primers and the tags are added as several attributes in the sequence heading.
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.fastq`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.fastq`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -193,7 +193,7 @@ it is convenient to work with uniq *sequences* instead of *reads*. To *dereplica
|
||||
| original dataset (in this way, all duplicated reads are |
|
||||
| removed) |
|
||||
| |
|
||||
| Definition adapted from [#]_ |
|
||||
| Definition adapted from Seguritan and Rohwer (2001) |
|
||||
+-------------------------------------------------------------+
|
||||
|
||||
|
||||
@ -201,26 +201,23 @@ We use the :doc:`obiuniq <scripts/obiuniq>` command with the `-m sample`. The `-
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obiuniq -m sample wolf.ali.ngs.fastq > wolf.ali.ngs.uniq.fasta
|
||||
> obiuniq -m sample wolf.ali.assigned.fastq > wolf.ali.assigned.uniq.fasta
|
||||
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.fasta`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB ali_length=61;
|
||||
seq_ab_match=47; sminR=40.0; tail_quality=67.0;
|
||||
reverse_match=tagaacaggctcctctag; seq_a_deletion=1;
|
||||
forward_match=ttagataccccactatgc; forward_primer=ttagataccccactatgc;
|
||||
reverse_primer=tagaacaggctcctctag; sminL=40.0; merged_sample={'29a_F260619': 1};
|
||||
forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct; seq_b_mismatch=7;
|
||||
score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
|
||||
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0;
|
||||
direction=forward; seq_b_insertion=0; experiment=wolf_diet; seq_b_deletion=1;
|
||||
seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct; count=1;
|
||||
seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
|
||||
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
|
||||
ggttaaaactcaaaggacttggcggtgctttataccctt
|
||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP ali_length=61; seq_ab_match=47;
|
||||
sminR=40.0; tail_quality=67.0; reverse_match=ttagataccccactatgc; seq_a_deletion=1;
|
||||
forward_match=tagaacaggctcctctag; forward_primer=tagaacaggctcctctag; reverse_primer=ttagataccccactatgc;
|
||||
sminL=40.0; merged_sample={'29a_F260619': 1}; forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct;
|
||||
seq_b_mismatch=7; score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
|
||||
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0; direction=reverse; seq_b_insertion=0;
|
||||
experiment=wolf_diet; seq_b_deletion=1; seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct;
|
||||
count=1; seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
|
||||
aatgattttgttataataattacttgtgtttagggctaa
|
||||
|
||||
The run of :doc:`obiuniq <scripts/obiuniq>` has added two key=values entries in the header of the fasta sequence :
|
||||
- :py:mod:`merged_sample={'29a_F260619': 1}` : this sequence have been found once in a single sample
|
||||
@ -232,46 +229,37 @@ To keep only these two ``key=value`` informations, we can use the :doc:`obiannot
|
||||
.. code-block:: bash
|
||||
|
||||
> obiannotate -k count -k merged_sample \
|
||||
wolf.ali.ngs.uniq.fasta > $$ ; mv $$ wolf.ali.ngs.uniq.fasta
|
||||
wolf.ali.assigned.uniq.fasta > $$ ; mv $$ wolf.ali.assigned.uniq.fasta
|
||||
|
||||
|
||||
The first five sequence records of ``wolf.ali.ngs.uniq.fasta`` becomes:
|
||||
The first five sequence records of ``wolf.ali.assigned.uniq.fasta`` becomes:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB
|
||||
merged_sample={'29a_F260619': 1}; count=1;
|
||||
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
|
||||
ggttaaaactcaaaggacttggcggtgctttataccctt
|
||||
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB
|
||||
merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
|
||||
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaat
|
||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||||
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB
|
||||
merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
|
||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaag
|
||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
|
||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
|
||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||||
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB
|
||||
merged_sample={'26a_F040644': 10490}; count=10490;
|
||||
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
|
||||
gcctgaaactcaaaggacttggcggtgctttacatccct
|
||||
|
||||
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 1}; count=1;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
|
||||
aatgattttgttataataattacttgtgtttagggctaa
|
||||
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||
gaacaattttgttatattaattacttgtgtttagggctaa
|
||||
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggc
|
||||
gaataattttgttatattaattacttgtgtttagggctaa
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||
gaataattttgttatattaattacttgtgtttagggctaa
|
||||
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB_CMP merged_sample={'26a_F040644': 10490}; count=10490;
|
||||
agggatgtaaagcaccgccaagtcctttgagtttcaggctgttgctagtagtactctggc
|
||||
gaacattcttgtttattgaatgtttatgtttagggctaa
|
||||
|
||||
|
||||
Denoising the sequence dataset
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To have a set of sequences assigned to their original samples does not mean that all sequences
|
||||
are *biologically* meaningful i.e. some of these sequences can contains
|
||||
PCR and/or sequencing errors, or chimeras.
|
||||
To remove such sequences as much as possible, we first remove rare sequences and then remove
|
||||
sequences variants that likely correspond to artifacts from the original dataset.
|
||||
are *biologically* meaningful i.e. some of these sequences can contains PCR and/or sequencing
|
||||
errors, or chimeras. To remove such sequences as much as possible, we first remove rare sequences
|
||||
and then remove sequences variants that likely correspond to artifacts.
|
||||
|
||||
|
||||
|
||||
@ -279,11 +267,11 @@ Get the counts statistics
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In that case, we use :doc:`obistat <scripts/obistat>` to get the counting statistics on the 'count' attribute (the count attribute has been set by the :doc:`obiuniq <scripts/obiuniq>` command). By piping
|
||||
the result in the unix commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
|
||||
the result in the *Unix* commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obistat -c count wolf.ali.ngs.uniq.fasta | \
|
||||
> obistat -c count wolf.ali.assigned.uniq.fasta | \
|
||||
sort -nk1 | head -20
|
||||
|
||||
This print the output:
|
||||
@ -318,107 +306,162 @@ The dataset contains 3504 sequences occurring only once.
|
||||
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Based on the previous observation, we set the cut-off for keeping sequences for further analysis to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
|
||||
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be evaluated to :py:mod:`True` for each sequence to be kept. We also remove
|
||||
sequences with a length shorter than 80 bp (option -l).
|
||||
Based on the previous observation, we set the cut-off for keeping sequences for further analysis
|
||||
to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
|
||||
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be
|
||||
evaluated to :py:mod:`True` for each sequence to be kept. Based on previous knowledge we also remove
|
||||
sequences with a length shorter than 80 bp (option -l) as we know that the amplified 12S-V5 barcode
|
||||
for vertebrate must have a length arround 100bp.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obigrep -l 80 -p 'count>=10' wolf.ali.ngs.uniq.fasta \
|
||||
> wolf.ali.ngs.uniq.c10.l80.fasta
|
||||
> obigrep -l 80 -p 'count>=10' wolf.ali.assigned.uniq.fasta \
|
||||
> wolf.ali.assigned.uniq.c10.l80.fasta
|
||||
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.fasta`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP count=12335;
|
||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB count=12335; merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
||||
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||
gaataattttgttatattaattacttgtgtttagggctaa
|
||||
|
||||
|
||||
Clean the sequences for PCR/sequencing errors (sequence variants)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences (``-H`` option) that are sequences with no variants with greater count or
|
||||
sequences with no variants with 20-fold greater (``-r 0.05`` option).
|
||||
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences
|
||||
(``-H`` option) that are sequences with no variants with greater count or sequences with no variants
|
||||
with 20-fold greater (``-r 0.05`` option).
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obiclean -s merged_sample -r 0.05 -H \
|
||||
wolf.ali.ngs.uniq.c10.l80.fasta > wolf.ali.ngs.uniq.c10.l80.clean.fasta
|
||||
wolf.ali.assigned.uniq.c10.l80.fasta > wolf.ali.assigned.uniq.c10.l80.clean.fasta
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.fasta`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
|
||||
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB
|
||||
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
|
||||
obiclean_count={'29a_F260619': 5438, '15a_F730814': 8642}; obiclean_head=True;
|
||||
obiclean_cluster={'29a_F260619':
|
||||
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP',
|
||||
'15a_F730814':
|
||||
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP'}; count=12335;
|
||||
obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814':
|
||||
'h'}; obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
|
||||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
||||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||||
|
||||
obiclean_cluster={'29a_F260619': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB', '15a_F730814': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB'};
|
||||
count=12335; obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814': 'h'};
|
||||
obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
|
||||
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
|
||||
gaataattttgttatattaattacttgtgtttagggctaa
|
||||
|
||||
Taxonomic assignment of sequences
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The taxonomic assignement of sequences requires a reference database compiling all possible species to be identified in the sample. The assignment is then done
|
||||
based on sequence comparisons between the sample sequences and the reference sequences.
|
||||
Once denoising has been done, the next step in diet analysis is to relate the barcodes to their respective
|
||||
species in order to get the list of species associated to each sample.
|
||||
|
||||
The taxonomic assignement of sequences requires a reference database compiling all possible species to be
|
||||
identified in the sample. The assignment is then done based on sequence comparisons between the sample
|
||||
sequences and the reference sequences.
|
||||
|
||||
|
||||
Build a reference database
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two
|
||||
primers (`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each read to its sample (file ``wolf_ngsfilter.txt``). Note that the primers must
|
||||
be in the same order both in ``wolf_ngsfilter.txt`` and in the :doc`ecoPCR <scripts/ecoPCR>` command.
|
||||
As a rough way to build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate
|
||||
a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers
|
||||
(`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each
|
||||
read to its sample (file ``wolf_diet_ngsfilter.txt``).
|
||||
|
||||
The full list of steps to do in order to build this reference database would then be:
|
||||
|
||||
1. Download the whole set of EMBL sequences (available from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/)
|
||||
2. Download the NCBI taxonomy (available from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
|
||||
3. Format them into the ecoPCR format (set :doc:`obiconvert <scripts/obiconvert>` for how you can produce
|
||||
ecoPCR compatible files)
|
||||
4. Use ecoPCR to simulate amplification and build a reference database on the basis of putatively
|
||||
amplified barcodes together with their recorded taxonomic information
|
||||
|
||||
As step 1 and step 3 can be really time consuming (about one day) we provide the reference database
|
||||
produce by the following commands so that you can skip the following steps. Note that as both the EMBL database
|
||||
and the taxonomic data can evolve dayly, if you run the following commands you may end up with quite different
|
||||
results.
|
||||
|
||||
|
||||
Note that any utility that allows downloading of files from a ftp site can be used. In the following commands,
|
||||
we use the commonly used ``wget`` *Unix* command.
|
||||
|
||||
Download the sequences
|
||||
......................
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> mkdir EMBL
|
||||
> cd EMBL
|
||||
> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
|
||||
> cd ..
|
||||
|
||||
Download the taxonomy
|
||||
.....................
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> mkdir TAXO
|
||||
> cd TAXO
|
||||
> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
|
||||
> tar -zxvf taxdump.tar.gz
|
||||
> cd ..
|
||||
|
||||
Format the data
|
||||
...............
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obiconvert --embl -t ./TAXO --ecopcrDB-output=embl_last ./EMBL/*.dat
|
||||
|
||||
|
||||
Retrieve the sequences
|
||||
......................
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> ecoPCR -d /Volumes/R0/Barcode-Leca/R117/embl_r117 -e 3 -l 50 -L 150 \
|
||||
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05_r117.ecopcr
|
||||
|
||||
> ecoPCR -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
|
||||
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
|
||||
|
||||
|
||||
Note that the primers must be in the same order both
|
||||
in ``wolf_diet_ngsfilter.txt`` and in the :doc:`ecoPCR <scripts/ecoPCR>` command.
|
||||
|
||||
|
||||
Clean the database
|
||||
..................
|
||||
|
||||
|
||||
1. filter the sequences so that they have a good taxonomic description at the species,
|
||||
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
|
||||
|
||||
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
|
||||
2. remove redundant sequences (:doc:`obiuniq <scripts/obiuniq>` command below).
|
||||
|
||||
3. ensure that the dereplicated sequences have a taxid at the family level
|
||||
(:doc:`obigrep <scripts/obigrep>` command below).
|
||||
|
||||
4. ensure that sequences each have a uniq identification (`awk` command below)
|
||||
(:doc:`obigrep <scripts/obigrep>` command below).
|
||||
4. ensure that sequences each have a uniq identification (:doc:`obiannotate <scripts/obiannotate>`
|
||||
command below)
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=species \
|
||||
--require-rank=genus --require-rank=family v05_r117.ecopcr > v05_r117_clean.fasta
|
||||
> obigrep -d embl_last --require-rank=species \
|
||||
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
|
||||
|
||||
> obiuniq -d /Volumes/R0/Barcode-Leca/R117/embl_r117 \
|
||||
v05_r117_clean.fasta > v05_r117_clean_uniq.fasta
|
||||
> obiuniq -d embl_last \
|
||||
v05_clean.fasta > v05_clean_uniq.fasta
|
||||
|
||||
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=family \
|
||||
v05_r117_clean_uniq.fasta > v05_r117_clean_uniq_clean.fasta
|
||||
> obigrep -d embl_last --require-rank=family \
|
||||
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
|
||||
|
||||
> awk '/^>/ && ($1 in nb) {nb[$1]++;$1=$1"."nb[$1];print;next;}/^>/{nb[$1]=0;}1' \
|
||||
v05_r117_clean_uniq_clean.fasta > db_v05_r117.fasta
|
||||
> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
|
||||
|
||||
|
||||
.. warning::
|
||||
From now, for the sake of clarity, the following commands will use the filenames of the provided data.
|
||||
If you decided to run the last steps and use the files you have produced, you'll have to use
|
||||
``db_v05.ecopcr`` instead of ``db_v05_r117.ecopcr`` and ``embl_last`` instead of ``embl_r117``
|
||||
|
||||
|
||||
Assign each sequence to a taxon
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -428,8 +471,8 @@ the :doc:`ecotag <scripts/ecotag>` command.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.ngs.uniq.c10.l80.clean.fasta > \
|
||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta
|
||||
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.assigned.uniq.c10.l80.clean.fasta > \
|
||||
wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta
|
||||
|
||||
|
||||
The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among them are :
|
||||
@ -437,10 +480,10 @@ The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among th
|
||||
- best_match=ACCESSION where ACCESSION is the id of one the sequence in the reference database that best align to the query sequence;
|
||||
- best_identity=FLOAT where FLOAT*100 is the percentage identity between the best match sequence and the query sequence;
|
||||
- taxid=TAXID where TAXID is the final assignation of the sequence by :doc:`ecotag <scripts/ecotag>`
|
||||
(may be different if the query sequence math to several sequences in the reference database);
|
||||
- scientific_name=NAME where NAME is the scientific name of the assigned taxid.
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta`` is:
|
||||
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -477,11 +520,11 @@ Some unuseful attributes can be removed.
|
||||
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
|
||||
--delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount \
|
||||
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name \
|
||||
--delete-tag=order wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta > \
|
||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta
|
||||
--delete-tag=order wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta > \
|
||||
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta
|
||||
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -501,10 +544,10 @@ The sequences can be sorted in decreasing order of `count`.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obisort -k count -r wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta > \
|
||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta
|
||||
> obisort -k count -r wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta > \
|
||||
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta
|
||||
|
||||
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
|
||||
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -523,8 +566,8 @@ Finally, a tab delimited file that can be open by excel or R is generated.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
> obitab -o wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta > \
|
||||
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.tab
|
||||
> obitab -o wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta > \
|
||||
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.tab
|
||||
|
||||
|
||||
This file contains 26 sequences. You can deduce the diet of each sample:
|
||||
@ -545,6 +588,8 @@ References
|
||||
- Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers:
|
||||
inference of new DNA barcode markers from whole genome sequence analysis. Nucleic
|
||||
Acids Research, 39, e145.
|
||||
- Seguritan V, Rohwer F. (2001) FastGroup: a program to dereplicate libraries of
|
||||
16S rDNA sequences. BMC Bioinformatics. 2001;2:9. Epub 2001 Oct 16.
|
||||
|
||||
|
||||
Contact
|
||||
|
Reference in New Issue
Block a user