Relecture de Pierre + ajout fabrication base de référence

This commit is contained in:
Frédéric Boyer
2014-04-08 15:24:56 +00:00
parent 70449d4638
commit 21860789a9

View File

@ -46,7 +46,7 @@ The data needed to run the tutorial are the following:
- the file describing the primers and tags used for all samples sequenced:
* ``wolf_ngsfilter.txt``
* ``wolf_diet_ngsfilter.txt``
The tags correspond to short and specific sequences added on the 5' end of each primer to distinguish the different samples
- the file containing the reference database in fasta format:
@ -91,7 +91,7 @@ In our case, the command is:
The :py:mod:`--score-min` option allows to avoid returning badly aligned sequence. If the alignment score is below 40, the
forward and reverse reads are not aligned but concatenated, and the value of the :py:mod:`mode` attribute is set to :py:mod:`joined`
instead of :py:mod:`alignement`
instead of :py:mod:`alignment`
Remove not aligned sequence records
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -133,27 +133,27 @@ Assign each sequence record to its sample
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Each sequence record is assigned to its original sample and to its experiment by using the information
provided in a text file (here ``wolf_ngsfilter.txt``). This text file contains one line per sample, with the name
provided in a text file (here ``wolf_diet_ngsfilter.txt``). This text file contains one line per sample, with the name
of the experiment (several experiment can be indicated in the same file), the name of the tags (for example: ``aattaac`` if the
same tag has been used on each extremity of the PCR products, or ``aattaac:gaagtag`` if the tags are different), the sequence of the
forward primer, the sequence of the reverse primer, the letter ``F`` or ``T`` for sample identification using the forward primer and tag
only or using both primers and both tags, respectively.
only or using both primers and both tags, respectively (see :doc:`ngsfilter <scripts/ngsfilter>` for details).
.. code-block:: bash
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.ngs.fastq
> ngsfilter -t wolf_diet_ngsfilter.txt -u unidentified.fastq wolf.ali.fastq > wolf.ali.assigned.fastq
This command creates two files:
- ``unidentified.fastq`` containing all the sequence records that were not assigned to a sample
- ``wolf.ali.ngs.fastq`` containing all the sequence records that were properly assigned to a sample
- ``wolf.ali.assigned.fastq`` containing all the sequence records that were properly assigned to a sample
Note that each sequence record of the ``wolf.ali.ngs.fastq`` file contains only the barcode sequence
Note that each sequence record of the ``wolf.ali.assigned.fastq`` file contains only the barcode sequence
as the sequences of primers and tags were removed. The information concerning the experiment, the sample,
primers and the tags are added as several attributes in the sequence heading.
The first sequence record of ``wolf.ali.ngs.fastq`` is:
The first sequence record of ``wolf.ali.assigned.fastq`` is:
.. code-block:: bash
@ -193,7 +193,7 @@ it is convenient to work with uniq *sequences* instead of *reads*. To *dereplica
| original dataset (in this way, all duplicated reads are |
| removed) |
| |
| Definition adapted from [#]_ |
| Definition adapted from Seguritan and Rohwer (2001) |
+-------------------------------------------------------------+
@ -201,26 +201,23 @@ We use the :doc:`obiuniq <scripts/obiuniq>` command with the `-m sample`. The `-
.. code-block:: bash
> obiuniq -m sample wolf.ali.ngs.fastq > wolf.ali.ngs.uniq.fasta
> obiuniq -m sample wolf.ali.assigned.fastq > wolf.ali.assigned.uniq.fasta
The first sequence record of ``wolf.ali.ngs.uniq.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.fasta`` is:
.. code-block:: bash
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB ali_length=61;
seq_ab_match=47; sminR=40.0; tail_quality=67.0;
reverse_match=tagaacaggctcctctag; seq_a_deletion=1;
forward_match=ttagataccccactatgc; forward_primer=ttagataccccactatgc;
reverse_primer=tagaacaggctcctctag; sminL=40.0; merged_sample={'29a_F260619': 1};
forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct; seq_b_mismatch=7;
score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0;
direction=forward; seq_b_insertion=0; experiment=wolf_diet; seq_b_deletion=1;
seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct; count=1;
seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
ggttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP ali_length=61; seq_ab_match=47;
sminR=40.0; tail_quality=67.0; reverse_match=ttagataccccactatgc; seq_a_deletion=1;
forward_match=tagaacaggctcctctag; forward_primer=tagaacaggctcctctag; reverse_primer=ttagataccccactatgc;
sminL=40.0; merged_sample={'29a_F260619': 1}; forward_score=72.0; seq_a_mismatch=7; forward_tag=gcctcct;
seq_b_mismatch=7; score=115.761290673; mid_quality=69.4210526316; avg_quality=69.1045751634;
seq_a_single=46; score_norm=1.89772607661; reverse_score=72.0; direction=reverse; seq_b_insertion=0;
experiment=wolf_diet; seq_b_deletion=1; seq_a_insertion=0; seq_length_ori=153; reverse_tag=gcctcct;
count=1; seq_length=99; status=full; mode=alignment; head_quality=67.0; seq_b_single=46;
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
aatgattttgttataataattacttgtgtttagggctaa
The run of :doc:`obiuniq <scripts/obiuniq>` has added two key=values entries in the header of the fasta sequence :
- :py:mod:`merged_sample={'29a_F260619': 1}` : this sequence have been found once in a single sample
@ -232,46 +229,37 @@ To keep only these two ``key=value`` informations, we can use the :doc:`obiannot
.. code-block:: bash
> obiannotate -k count -k merged_sample \
wolf.ali.ngs.uniq.fasta > $$ ; mv $$ wolf.ali.ngs.uniq.fasta
wolf.ali.assigned.uniq.fasta > $$ ; mv $$ wolf.ali.assigned.uniq.fasta
The first five sequence records of ``wolf.ali.ngs.uniq.fasta`` becomes:
The first five sequence records of ``wolf.ali.assigned.uniq.fasta`` becomes:
.. code-block:: bash
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB
merged_sample={'29a_F260619': 1}; count=1;
ttagccctaaacacaagtaattattataacaaaatcattcgccagagtgtagcgggagta
ggttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB
merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB
merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaag
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB
merged_sample={'26a_F040644': 10490}; count=10490;
ttagccctaaacataaacattcaataaacaagaatgttcgccagagtactactagcaaca
gcctgaaactcaaaggacttggcggtgctttacatccct
>HELIUM_000100422_612GNAAXX:7:119:14871:19157#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 1}; count=1;
aagggtataaagcaccgccaagtcctttgagttttaacctactcccgctacactctggcg
aatgattttgttataataattacttgtgtttagggctaa
>HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 7, '15a_F730814': 2}; count=9;
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
gaacaattttgttatattaattacttgtgtttagggctaa
>HELIUM_000100422_612GNAAXX:7:97:14311:19299#0/1_CONS_SUB_SUB_CMP merged_sample={'29a_F260619': 5, '15a_F730814': 4}; count=9;
aagggtataaagcaccgccaagtcctttgagttttaagctcttgccggtagtactctggc
gaataattttgttatattaattacttgtgtttagggctaa
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB merged_sample={'29a_F260619': 4697, '15a_F730814': 7638}; count=12335;
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
gaataattttgttatattaattacttgtgtttagggctaa
>HELIUM_000100422_612GNAAXX:7:57:18459:16145#0/1_CONS_SUB_SUB_CMP merged_sample={'26a_F040644': 10490}; count=10490;
agggatgtaaagcaccgccaagtcctttgagtttcaggctgttgctagtagtactctggc
gaacattcttgtttattgaatgtttatgtttagggctaa
Denoising the sequence dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To have a set of sequences assigned to their original samples does not mean that all sequences
are *biologically* meaningful i.e. some of these sequences can contains
PCR and/or sequencing errors, or chimeras.
To remove such sequences as much as possible, we first remove rare sequences and then remove
sequences variants that likely correspond to artifacts from the original dataset.
are *biologically* meaningful i.e. some of these sequences can contains PCR and/or sequencing
errors, or chimeras. To remove such sequences as much as possible, we first remove rare sequences
and then remove sequences variants that likely correspond to artifacts.
@ -279,11 +267,11 @@ Get the counts statistics
~~~~~~~~~~~~~~~~~~~~~~~~~
In that case, we use :doc:`obistat <scripts/obistat>` to get the counting statistics on the 'count' attribute (the count attribute has been set by the :doc:`obiuniq <scripts/obiuniq>` command). By piping
the result in the unix commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
the result in the *Unix* commands ``sort`` and ``head`` we keep only the counting statistics for the 20 lowest values of the 'count' attributes.
.. code-block:: bash
> obistat -c count wolf.ali.ngs.uniq.fasta | \
> obistat -c count wolf.ali.assigned.uniq.fasta | \
sort -nk1 | head -20
This print the output:
@ -318,107 +306,162 @@ The dataset contains 3504 sequences occurring only once.
Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Based on the previous observation, we set the cut-off for keeping sequences for further analysis to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be evaluated to :py:mod:`True` for each sequence to be kept. We also remove
sequences with a length shorter than 80 bp (option -l).
Based on the previous observation, we set the cut-off for keeping sequences for further analysis
to a count of 10. To do this, we use the :doc:`obigrep <scripts/obigrep>` command.
The ``-p 'count>=10'`` option means that the ``python`` expression :py:mod:`count>=10` must be
evaluated to :py:mod:`True` for each sequence to be kept. Based on previous knowledge we also remove
sequences with a length shorter than 80 bp (option -l) as we know that the amplified 12S-V5 barcode
for vertebrate must have a length arround 100bp.
.. code-block:: bash
> obigrep -l 80 -p 'count>=10' wolf.ali.ngs.uniq.fasta \
> wolf.ali.ngs.uniq.c10.l80.fasta
> obigrep -l 80 -p 'count>=10' wolf.ali.assigned.uniq.fasta \
> wolf.ali.assigned.uniq.c10.l80.fasta
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.fasta`` is:
.. code-block:: bash
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP count=12335;
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB count=12335; merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
gaataattttgttatattaattacttgtgtttagggctaa
Clean the sequences for PCR/sequencing errors (sequence variants)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences (``-H`` option) that are sequences with no variants with greater count or
sequences with no variants with 20-fold greater (``-r 0.05`` option).
As a final step of denoising, using the :doc:`obiclean <scripts/obiclean>` we keep the `Head` sequences
(``-H`` option) that are sequences with no variants with greater count or sequences with no variants
with 20-fold greater (``-r 0.05`` option).
.. code-block:: bash
> obiclean -s merged_sample -r 0.05 -H \
wolf.ali.ngs.uniq.c10.l80.fasta > wolf.ali.ngs.uniq.c10.l80.clean.fasta
wolf.ali.assigned.uniq.c10.l80.fasta > wolf.ali.assigned.uniq.c10.l80.clean.fasta
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.fasta`` is:
.. code-block:: bash
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP
>HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB
merged_sample={'29a_F260619': 4697, '15a_F730814': 7638};
obiclean_count={'29a_F260619': 5438, '15a_F730814': 8642}; obiclean_head=True;
obiclean_cluster={'29a_F260619':
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP',
'15a_F730814':
'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB_CMP'}; count=12335;
obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814':
'h'}; obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agcttaaaactcaaaggacttggcggtgctttataccctt
obiclean_cluster={'29a_F260619': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB', '15a_F730814': 'HELIUM_000100422_612GNAAXX:7:22:8540:14708#0/1_CONS_SUB_SUB'};
count=12335; obiclean_internalcount=0; obiclean_status={'29a_F260619': 'h', '15a_F730814': 'h'};
obiclean_samplecount=2; obiclean_headcount=2; obiclean_singletoncount=0;
aagggtataaagcaccgccaagtcctttgagttttaagctattgccggtagtactctggc
gaataattttgttatattaattacttgtgtttagggctaa
Taxonomic assignment of sequences
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The taxonomic assignement of sequences requires a reference database compiling all possible species to be identified in the sample. The assignment is then done
based on sequence comparisons between the sample sequences and the reference sequences.
Once denoising has been done, the next step in diet analysis is to relate the barcodes to their respective
species in order to get the list of species associated to each sample.
The taxonomic assignement of sequences requires a reference database compiling all possible species to be
identified in the sample. The assignment is then done based on sequence comparisons between the sample
sequences and the reference sequences.
Build a reference database
~~~~~~~~~~~~~~~~~~~~~~~~~~
To build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two
primers (`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each read to its sample (file ``wolf_ngsfilter.txt``). Note that the primers must
be in the same order both in ``wolf_ngsfilter.txt`` and in the :doc`ecoPCR <scripts/ecoPCR>` command.
As a rough way to build the reference database, we use the :doc:`ecoPCR <scripts/ecoPCR>` program to simulate
a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers
(`TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG`) extracted from the samples description used to assign each
read to its sample (file ``wolf_diet_ngsfilter.txt``).
The full list of steps to do in order to build this reference database would then be:
1. Download the whole set of EMBL sequences (available from: ftp://ftp.ebi.ac.uk/pub/databases/embl/release/)
2. Download the NCBI taxonomy (available from: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
3. Format them into the ecoPCR format (set :doc:`obiconvert <scripts/obiconvert>` for how you can produce
ecoPCR compatible files)
4. Use ecoPCR to simulate amplification and build a reference database on the basis of putatively
amplified barcodes together with their recorded taxonomic information
As step 1 and step 3 can be really time consuming (about one day) we provide the reference database
produce by the following commands so that you can skip the following steps. Note that as both the EMBL database
and the taxonomic data can evolve dayly, if you run the following commands you may end up with quite different
results.
Note that any utility that allows downloading of files from a ftp site can be used. In the following commands,
we use the commonly used ``wget`` *Unix* command.
Download the sequences
......................
.. code-block:: bash
> mkdir EMBL
> cd EMBL
> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
> cd ..
Download the taxonomy
.....................
.. code-block:: bash
> mkdir TAXO
> cd TAXO
> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
> tar -zxvf taxdump.tar.gz
> cd ..
Format the data
...............
.. code-block:: bash
> obiconvert --embl -t ./TAXO --ecopcrDB-output=embl_last ./EMBL/*.dat
Retrieve the sequences
......................
.. code-block:: bash
> ecoPCR -d /Volumes/R0/Barcode-Leca/R117/embl_r117 -e 3 -l 50 -L 150 \
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05_r117.ecopcr
> ecoPCR -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
Note that the primers must be in the same order both
in ``wolf_diet_ngsfilter.txt`` and in the :doc:`ecoPCR <scripts/ecoPCR>` command.
Clean the database
..................
1. filter the sequences so that they have a good taxonomic description at the species,
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
genus, and family levels (:doc:`obigrep <scripts/obigrep>` command below).
2. remove redundant sequences (:doc:`obiuniq <scripts/obiuniq>` command below).
3. ensure that the dereplicated sequences have a taxid at the family level
(:doc:`obigrep <scripts/obigrep>` command below).
4. ensure that sequences each have a uniq identification (`awk` command below)
(:doc:`obigrep <scripts/obigrep>` command below).
4. ensure that sequences each have a uniq identification (:doc:`obiannotate <scripts/obiannotate>`
command below)
.. code-block:: bash
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=species \
--require-rank=genus --require-rank=family v05_r117.ecopcr > v05_r117_clean.fasta
> obigrep -d embl_last --require-rank=species \
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
> obiuniq -d /Volumes/R0/Barcode-Leca/R117/embl_r117 \
v05_r117_clean.fasta > v05_r117_clean_uniq.fasta
> obiuniq -d embl_last \
v05_clean.fasta > v05_clean_uniq.fasta
> obigrep -d /Volumes/R0/Barcode-Leca/R117/embl_r117 --require-rank=family \
v05_r117_clean_uniq.fasta > v05_r117_clean_uniq_clean.fasta
> obigrep -d embl_last --require-rank=family \
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
> awk '/^>/ && ($1 in nb) {nb[$1]++;$1=$1"."nb[$1];print;next;}/^>/{nb[$1]=0;}1' \
v05_r117_clean_uniq_clean.fasta > db_v05_r117.fasta
> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
.. warning::
From now, for the sake of clarity, the following commands will use the filenames of the provided data.
If you decided to run the last steps and use the files you have produced, you'll have to use
``db_v05.ecopcr`` instead of ``db_v05_r117.ecopcr`` and ``embl_last`` instead of ``embl_r117``
Assign each sequence to a taxon
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -428,8 +471,8 @@ the :doc:`ecotag <scripts/ecotag>` command.
.. code-block:: bash
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.ngs.uniq.c10.l80.clean.fasta > \
wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta
> ecotag -d embl_r117 -R db_v05_r117.fasta wolf.ali.assigned.uniq.c10.l80.clean.fasta > \
wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta
The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among them are :
@ -437,10 +480,10 @@ The :doc:`ecotag <scripts/ecotag>` adds several `key=value` attributes, among th
- best_match=ACCESSION where ACCESSION is the id of one the sequence in the reference database that best align to the query sequence;
- best_identity=FLOAT where FLOAT*100 is the percentage identity between the best match sequence and the query sequence;
- taxid=TAXID where TAXID is the final assignation of the sequence by :doc:`ecotag <scripts/ecotag>`
(may be different if the query sequence math to several sequences in the reference database);
- scientific_name=NAME where NAME is the scientific name of the assigned taxid.
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta`` is:
.. code-block:: bash
@ -477,11 +520,11 @@ Some unuseful attributes can be removed.
--delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount \
--delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount \
--delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name \
--delete-tag=order wolf.ali.ngs.uniq.c10.l80.clean.tag.fasta > \
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta
--delete-tag=order wolf.ali.assigned.uniq.c10.l80.clean.tag.fasta > \
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta`` is:
.. code-block:: bash
@ -501,10 +544,10 @@ The sequences can be sorted in decreasing order of `count`.
.. code-block:: bash
> obisort -k count -r wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.fasta > \
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta
> obisort -k count -r wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.fasta > \
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta
The first sequence record of ``wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
The first sequence record of ``wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta`` is:
.. code-block:: bash
@ -523,8 +566,8 @@ Finally, a tab delimited file that can be open by excel or R is generated.
.. code-block:: bash
> obitab -o wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.fasta > \
wolf.ali.ngs.uniq.c10.l80.clean.tag.ann.sort.tab
> obitab -o wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.fasta > \
wolf.ali.assigned.uniq.c10.l80.clean.tag.ann.sort.tab
This file contains 26 sequences. You can deduce the diet of each sample:
@ -545,6 +588,8 @@ References
- Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers:
inference of new DNA barcode markers from whole genome sequence analysis. Nucleic
Acids Research, 39, e145.
- Seguritan V, Rohwer F. (2001) FastGroup: a program to dereplicate libraries of
16S rDNA sequences. BMC Bioinformatics. 2001;2:9. Epub 2001 Oct 16.
Contact