Relecture de Pierre + ajout fabrication base de référence

2014-04-08 15:24:56 +00:00
parent 70449d4638
commit 21860789a9
1 changed files with 160 additions and 115 deletions
--- a/doc/sphinx/source/wolves.rst
+++ b/doc/sphinx/source/wolves.rst
@ -46,7 +46,7 @@ The data needed to run the tutorial are the following:
    
 - the file describing the primers and tags used for all samples sequenced:

-    * ``wolf_ngsfilter.txt``
+    * ``wolf_diet_ngsfilter.txt``
      The tags correspond to short and specific sequences added on the 5' end of each primer to distinguish the different samples
    
 - the file containing the reference database in fasta format:
@ -91,7 +91,7 @@ In our case, the command is:

 The :py:mod:`--score-min` option allows to avoid returning badly aligned sequence. If the alignment score is below 40, the 
 forward and reverse reads are not aligned but concatenated, and the value of the :py:mod:`mode` attribute is set to :py:mod:`joined` 
-instead of :py:mod:`alignement`   
+instead of :py:mod:`alignment`   

 Remove not aligned sequence records
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -133,27 +133,27 @@ Assign each sequence record to its sample
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Each sequence record is assigned to its original sample and to its experiment by using the information
-provided in a text file (here ``wolf_ngsfilter.txt``). This text file contains one line per sample, with the name
+provided in a text file (here ``wolf_diet_ngsfilter.txt``). This text file contains one line per sample, with the name
 of the experiment (several experiment can be indicated in the same file), the name of the tags (for example: ``aattaac`` if the 
 same tag has been used on each extremity of the PCR products, or ``aattaac:gaagtag`` if the tags are different), the sequence of the 
 forward primer, the sequence of the reverse primer, the letter ``F`` or ``T`` for sample identification using the forward primer and tag 
-only or using both primers and both tags, respectively.
+only or using both primers and both tags, respectively (see :doc:`ngsfilter  <scripts/ngsfilter>` for details).

 .. code-block:: bash

-   > ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.ngs.fastq
+   > ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.assigned.fastq

 This command creates two files:

 - ``unidentified.fastq`` containing all the sequence records that were not assigned to a sample

- ``wolf.ali.ngs.fastq`` containing all the sequence records that were properly assigned to a sample
+- ``wolf.ali.assigned.fastq`` containing all the sequence records that were properly assigned to a sample

-Note that each sequence record of the ``wolf.ali.ngs.fastq`` file contains only the barcode sequence 
+Note that each sequence record of the ``wolf.ali.assigned.fastq`` file contains only the barcode sequence 
 as the sequences of primers and tags were removed. The information concerning the experiment, the sample, 
 primers and the tags are added as several attributes in the sequence heading.

-The first sequence record of ``wolf.ali.ngs.fastq`` is:
+The first sequence record of ``wolf.ali.assigned.fastq`` is:

 .. code-block:: bash

@ -193,7 +193,7 @@ it is convenient to work with uniq *sequences* instead of *reads*. To *dereplica
 |    original dataset (in this way, all duplicated reads are  |
 |    removed)                                                 |
 |                                                             |
-| Definition adapted from [#]_                                |
+| Definition adapted from Seguritan and Rohwer (2001)         |
 +-------------------------------------------------------------+


@ -201,26 +201,23 @@ We use the :doc:`obiuniq <scripts/obiuniq>` command with the `-m sample`. The `-

 .. code-block:: bash

-   > obiuniq -m sample wolf.ali.ngs.fastq > wolf.ali.ngs.uniq.fasta
+   > obiuniq -m sample wolf.ali.assigned.fastq > wolf.ali.assigned.uniq.fasta


-The first sequence record of ``wolf.ali.ngs.uniq.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.fasta`` is:

 .. code-block:: bash

-	>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB ali_length=61; 
-	seq_ab_match=47; sminR=40.0; tail_quality=67.0; 
-	reverse_match=tagaacaggctcctctag; seq_a_deletion=1; 
-	forward_match=ttagataccccactatgc; forward_primer=ttagataccccactatgc; 
-	reverse_primer=tagaacaggctcctctag; sminL=40.0; merged_sample={'29a_F260619': 1}; 
-	forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct; seq_b_mismatch=7; 
-	score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634; 
-	seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0; 
-	direction=forward; seq_b_insertion=0; experiment=wolf_diet; seq_b_deletion=1; 
-	seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct; count=1; 
-	seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46; 
-	ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
-	ggttaaaactcaaaggacttggcggtgctttataccctt
+   >HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP ali_length=61; seq_ab_match=47; 
+   sminR=40.0; tail_quality=67.0; reverse_match=ttagataccccactatgc; seq_a_deletion=1; 
+   forward_match=tagaacaggctcctctag; forward_primer=tagaacaggctcctctag; reverse_primer=ttagataccccactatgc; 
+   sminL=40.0; merged_sample={'29a_F260619': 1}; forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct; 
+   seq_b_mismatch=7; score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634; 
+   seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0; direction=reverse; seq_b_insertion=0; 
+   experiment=wolf_diet; seq_b_deletion=1; seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct; 
+   count=1; seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46; 
+   aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
+   aatgattttgttataataattacttgtgtttagggctaa
 	
 The run of :doc:`obiuniq <scripts/obiuniq>` has added two key=values entries in the header of the fasta sequence :
   - :py:mod:`merged_sample={'29a_F260619': 1}` : this sequence have been found once in a single sample
@ -232,46 +229,37 @@ To keep only these two ``key=value`` informations, we can use the :doc:`obiannot
 .. code-block:: bash

   > obiannotate -k count -k merged_sample \
-     wolf.ali.ngs.uniq.fasta > $$ ; mv $$ wolf.ali.ngs.uniq.fasta
+     wolf.ali.assigned.uniq.fasta > $$ ; mv $$ wolf.ali.assigned.uniq.fasta


-The first five sequence records of ``wolf.ali.ngs.uniq.fasta`` becomes:
+The first five sequence records of ``wolf.ali.assigned.uniq.fasta`` becomes:

 .. code-block:: bash

-   >HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB 
-   merged_sample={'29a_F260619': 1}; count=1; 
-   ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
-   ggttaaaactcaaaggacttggcggtgctttataccctt
-   >HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB 
-   merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9; 
-   ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaat
-   agcttaaaactcaaaggacttggcggtgctttataccctt
-   >HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB 
-   merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9; 
-   ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaag
-   agcttaaaactcaaaggacttggcggtgctttataccctt
-   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP 
-   merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335; 
-   ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
-   agcttaaaactcaaaggacttggcggtgctttataccctt
-   >HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB 
-   merged_sample={'26a_F040644': 10490}; count=10490; 
-   ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
-   gcctgaaactcaaaggacttggcggtgctttacatccct
-
-
-
+   >HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 1}; count=1; 
+   aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
+   aatgattttgttataataattacttgtgtttagggctaa
+   >HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9; 
+   aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
+   gaacaattttgttatattaattacttgtgtttagggctaa
+   >HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9; 
+   aagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggc
+   gaataattttgttatattaattacttgtgtttagggctaa
+   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335; 
+   aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
+   gaataattttgttatattaattacttgtgtttagggctaa
+   >HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB_CMP merged_sample={'26a_F040644': 10490}; count=10490; 
+   agggatgtaaagcaccgccaagtcctttgagtttcaggctgttgctagtagtactctggc
+   gaacattcttgtttattgaatgtttatgtttagggctaa


 Denoising the sequence dataset
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 To have a set of sequences assigned to their original samples does not mean that all sequences 
-are *biologically* meaningful i.e. some of these sequences can contains 
-PCR and/or sequencing errors, or chimeras. 
-To remove such sequences as much as possible, we first remove rare sequences and then remove
-sequences variants that likely correspond to artifacts from the original dataset.
+are *biologically* meaningful i.e. some of these sequences can contains PCR and/or sequencing 
+errors, or chimeras. To remove such sequences as much as possible, we first remove rare sequences 
+and then remove sequences variants that likely correspond to artifacts.



@ -279,11 +267,11 @@ Get the counts statistics
 ~~~~~~~~~~~~~~~~~~~~~~~~~

 In that case, we use :doc:`obistat <scripts/obistat>` to get the counting statistics on the 'count' attribute (the count attribute has been set by the :doc:`obiuniq <scripts/obiuniq>` command). By piping 
-the result in the unix commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
+the result in the *Unix* commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.

 .. code-block:: bash

-   > obistat -c count wolf.ali.ngs.uniq.fasta |  \  
+   > obistat -c count wolf.ali.assigned.uniq.fasta |  \  
     sort -nk1 | head -20

 This print the output:
@ -318,107 +306,162 @@ The dataset contains 3504 sequences occurring only once.
 Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Based on the previous observation, we set the cut-off for keeping sequences for further analysis to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
-The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be evaluated to :py:mod:`True` for each sequence to be kept. We also remove 
-sequences with a length shorter than 80 bp (option -l).
+Based on the previous observation, we set the cut-off for keeping sequences for further analysis 
+to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
+The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be 
+evaluated to :py:mod:`True` for each sequence to be kept. Based on previous knowledge we also remove 
+sequences with a length shorter than 80 bp (option -l) as we know that the amplified 12S-V5 barcode 
+for vertebrate must have a length arround 100bp.

 .. code-block:: bash

-   > obigrep -l 80 -p 'count>=10' wolf.ali.ngs.uniq.fasta \
-   	 > wolf.ali.ngs.uniq.c10.l80.fasta
+   > obigrep -l 80 -p 'count>=10' wolf.ali.assigned.uniq.fasta \
+   	 > wolf.ali.assigned.uniq.c10.l80.fasta
   	 
   	 
-The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.fasta`` is:

 .. code-block:: bash	 

-   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP count=12335; 
-   merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; 
-   ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
-   agcttaaaactcaaaggacttggcggtgctttataccctt
-
+   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB count=12335; merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; 
+   aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
+   gaataattttgttatattaattacttgtgtttagggctaa
 	

 Clean the sequences for PCR/sequencing errors (sequence variants)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences (``-H`` option) that are sequences with no variants with greater count or 
-sequences with no variants with 20-fold greater (``-r 0.05`` option).
+As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences 
+(``-H`` option) that are sequences with no variants with greater count or sequences with no variants 
+with 20-fold greater (``-r 0.05`` option).

 .. code-block:: bash

   > obiclean -s merged_sample -r 0.05 -H \
-     wolf.ali.ngs.uniq.c10.l80.fasta > wolf.ali.ngs.uniq.c10.l80.clean.fasta 
+     wolf.ali.assigned.uniq.c10.l80.fasta > wolf.ali.assigned.uniq.c10.l80.clean.fasta 

-The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.fasta`` is:

 .. code-block:: bash

-   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP 
+   >HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB 
   merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; 
   obiclean_count={'29a_F260619': 5438, '15a_F730814': 8642}; obiclean_head=True; 
-   obiclean_cluster={'29a_F260619': 
-   'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP', 
-   '15a_F730814': 
-   'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP'}; count=12335; 
-   obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814': 
-   'h'}; obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0; 
-   ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
-   agcttaaaactcaaaggacttggcggtgctttataccctt
-
+   obiclean_cluster={'29a_F260619': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB', '15a_F730814': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB'}; 
+   count=12335; obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814': 'h'}; 
+   obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0; 
+   aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
+   gaataattttgttatattaattacttgtgtttagggctaa

 Taxonomic assignment of sequences
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-The taxonomic assignement of sequences requires a reference database compiling all possible species to be identified in the sample. The assignment is then done 
-based on sequence comparisons between the sample sequences and the reference sequences.
+Once denoising has been done, the next step in diet analysis is to relate the barcodes to their respective 
+species in order to get the list of species associated to each sample.
+
+The taxonomic assignement of sequences requires a reference database compiling all possible species to be 
+identified in the sample. The assignment is then done based on sequence comparisons between the sample 
+sequences and the reference sequences.


 Build a reference database
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-To build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two
-primers (`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each read to its sample (file ``wolf_ngsfilter.txt``). Note that the primers must
-be in the same order both in ``wolf_ngsfilter.txt`` and in the :doc`ecoPCR <scripts/ecoPCR>` command.
+As a rough way to build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate 
+a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers 
+(`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each 
+read to its sample (file ``wolf_diet_ngsfilter.txt``). 
+
+The full list of steps to do in order to build this reference database would then be:
 
+1. Download the whole set of EMBL sequences (available from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/)
+2. Download the NCBI taxonomy (available from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
+3. Format them into the ecoPCR format (set :doc:`obiconvert <scripts/obiconvert>` for how you can produce 
+   ecoPCR compatible files)
+4. Use ecoPCR to simulate amplification and build a reference database on the basis of putatively
+   amplified barcodes together with their recorded taxonomic information  
+
+As step 1 and step 3 can be really time consuming (about one day) we provide the reference database
+produce by the following commands so that you can skip the following steps. Note that as both the EMBL database
+and the taxonomic data can evolve dayly, if you run the following commands you may end up with quite different
+results.
+
+
+Note that any utility that allows downloading of files from a ftp site can be used. In the following commands,
+we use the commonly used ``wget`` *Unix* command.
+
+Download the sequences
+......................
+
+.. code-block:: bash
+
+   > mkdir EMBL
+   > cd EMBL
+   > wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
+   > cd ..
+
+Download the taxonomy
+.....................
+
+.. code-block:: bash
+
+   > mkdir TAXO
+   > cd TAXO
+   > wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
+   > tar -zxvf taxdump.tar.gz
+   > cd ..
+
+Format the data
+...............
+
+.. code-block:: bash
+
+   > obiconvert --embl -t ./TAXO --ecopcrDB-output=embl_last ./EMBL/*.dat
+

 Retrieve the sequences
 ......................

 .. code-block:: bash

-   > ecoPCR -d /Volumes/R0/Barcode-Leca/R117/embl_r117 -e 3 -l 50 -L 150 \ 
-     TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05_r117.ecopcr
-   
+   > ecoPCR -d ./ECODB/embl_last -e 3 -l 50 -L 150 \ 
+     TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
+
+
+Note that the primers must be in the same order both 
+in ``wolf_diet_ngsfilter.txt`` and in the :doc:`ecoPCR <scripts/ecoPCR>` command.
+
+
 Clean the database
 ..................

-
    1. filter the sequences so that they have a good taxonomic description at the species, 
-    genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
-
+       genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
    2. remove redundant sequences (:doc:`obiuniq <scripts/obiuniq>` command below).
-
    3. ensure that the dereplicated sequences have a taxid at the family level 
-    (:doc:`obigrep <scripts/obigrep>` command below).
-       
-    4. ensure that sequences each have a uniq identification (`awk` command below)
+       (:doc:`obigrep <scripts/obigrep>` command below).
+    4. ensure that sequences each have a uniq identification (:doc:`obiannotate <scripts/obiannotate>` 
+       command below)

 .. code-block:: bash

-   > obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=species \
-     --require-rank=genus --require-rank=family v05_r117.ecopcr > v05_r117_clean.fasta
+   > obigrep -d embl_last --require-rank=species \
+     --require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
   
-   > obiuniq -d /Volumes/R0/Barcode-Leca/R117/embl_r117 \ 
-     v05_r117_clean.fasta > v05_r117_clean_uniq.fasta
+   > obiuniq -d embl_last \ 
+     v05_clean.fasta > v05_clean_uniq.fasta
   
-   > obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=family \ 
-     v05_r117_clean_uniq.fasta > v05_r117_clean_uniq_clean.fasta
+   > obigrep -d embl_last --require-rank=family \ 
+     v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
   
-   > awk '/^>/ && ($1 in nb) {nb[$1]++;$1=$1"."nb[$1];print;next;}/^>/{nb[$1]=0;}1' \
-     v05_r117_clean_uniq_clean.fasta > db_v05_r117.fasta
+   > obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta


+.. warning::
+   From now, for the sake of clarity, the following commands will use the filenames of the provided data. 
+   If you decided to run the last steps and use the files you have produced, you'll have to use
+   ``db_v05.ecopcr`` instead of ``db_v05_r117.ecopcr`` and ``embl_last`` instead of ``embl_r117``
+

 Assign each sequence to a taxon
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -428,8 +471,8 @@ the :doc:`ecotag <scripts/ecotag>` command.

 .. code-block:: bash

-   > ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.ngs.uniq.c10.l80.clean.fasta > \
-     wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta
+   > ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.assigned.uniq.c10.l80.clean.fasta > \
+     wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta


 The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among them are :
@ -437,10 +480,10 @@ The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among th
 - best_match=ACCESSION where ACCESSION is the id of one the sequence in the reference database that best align to the query sequence;
 - best_identity=FLOAT where FLOAT*100 is the percentage identity between the best match sequence and the query sequence;
 - taxid=TAXID where TAXID is the final assignation of the sequence by :doc:`ecotag <scripts/ecotag>` 
-(may be different if the query sequence math to several sequences in the reference database); 
 - scientific_name=NAME where NAME is the scientific name of the assigned taxid.

-The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta`` is:
+

 .. code-block:: bash

@ -477,11 +520,11 @@ Some unuseful attributes can be removed.
     --delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \ 
     --delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount \ 
     --delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name \ 
-     --delete-tag=order wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta > \ 
-     wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta
+     --delete-tag=order wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta > \ 
+     wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta


-The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta`` is:

 .. code-block:: bash

@ -501,10 +544,10 @@ The sequences can be sorted in decreasing order of `count`.

 .. code-block:: bash

-   > obisort -k count -r wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta >  \ 
-     wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta 
+   > obisort -k count -r wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta >  \ 
+     wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta 
   
-The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
+The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:

 .. code-block:: bash

@ -523,8 +566,8 @@ Finally, a tab delimited file that can be open by excel or R is generated.

 .. code-block:: bash

-   > obitab -o wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta > \ 
-     wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.tab
+   > obitab -o wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta > \ 
+     wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.tab
 
   
 This file contains 26 sequences. You can deduce the diet of each sample:
@ -545,6 +588,8 @@ References
 - Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers: 
   inference of new DNA barcode markers from whole genome sequence analysis. Nucleic 
   Acids Research, 39, e145.
+ - Seguritan V, Rohwer F. (2001) FastGroup: a program to dereplicate libraries of 
+   16S rDNA sequences. BMC Bioinformatics. 2001;2:9. Epub 2001 Oct 16.
 

 Contact