mercier created page: wolf_tutorial

Celine Mercier
2019-03-30 22:37:15 +01:00
commit 2d50a33e84

180
wolf_tutorial.md Normal file

@ -0,0 +1,180 @@
# Wolf tutorial with the OBITools3
A (cooler) remake of the infamous [wolf tutorial](https://pythonhosted.org/OBITools/wolves.html). And a work in progress.
### 0.1 Before starting: the OBITools3 data structure
The OBITools3 rely on an ad hoc database system, inside which all the data that a DNA metabarcoding experiment must consider is stored: the sequences, the metadata (describing for instance the samples), the database containing the reference sequences used for the taxonomic annotation, as well as the taxonomic databases.
DNA metabarcoding data can easily be represented in the form of tables, and each command can be regarded as an operation transforming one or several 'input' tables into one or several 'output' tables, which can be used by the next command.
The new database system used by the OBITools3 (called **DMS** for Data Management System) relies on **column**-oriented storage. Each column contains a data element, and several columns can be assembled in **views** representing the data tables (equivalent to a fasta file).
### 0.2 Before starting: the OBITools3 syntax
* Basic syntax:
obi command [options] input_URI output_URI
* A URI (Uniform Resource Identifier) is a simple way to identify the input and output of a command. For a file, it's simply the path to the file. For a view, it can be the path to the view file in the DMS:
path_to_my_dms.obidms/VIEWS/my_view.obiview
Or a simplified version:
path_to_my_dms/my_view
Any hybrid of those 2 works too.
**Note:** View names must be unique within a DMS, in other words, views can not be overwritten.
* If the output DMS is not given, the input DMS is used.
* For a taxonomy, both those and their hybrids work:
path_to_my_dms.obidms/TAXONOMY/my_taxonomy
path_to_my_dms/taxonomy/my_taxonomy
* `obi -h` gives a list of all the commands.
* `obi command -h` prints the help of the command.
### 1. Import the sequencing data in a DMS
1. Import the first set of reads, with :
obi import --quality-solexa wolf_tutorial/wolf_F.fastq wolf/reads1
`--quality-solexa` is the appropriate fastq quality option because it's an old dataset, `wolf_tutorial/wolf_F.fastq` is the path to the file to import, `wolf` is the path to the DMS that will be automatically created, and `reads1` is the name of the view into which the file will be imported.
2. Import the second set of reads:
obi import --quality-solexa wolf_tutorial/wolf_R.fastq wolf/reads2
3. Import the [ngsfilter file](https://pythonhosted.org/OBITools/scripts/ngsfilter.html) describing the primers and tags used for each sample:
obi import --ngsfilter wolf_tutorial/wolf_diet_ngsfilter.txt wolf/ngsfile
4. Check what is in the DMS:
obi ls wolf
You can also check just one view:
obi ls wolf/reads1
Or one column:
obi ls wolf/ngsfile/sample
To print the sequences, use the less command:
obi less wolf/reads1
### 2. Assign each sequence record to the corresponding sample/marker combination
Unlike the OBITools1, the OBITools3 make it possible to run ngsfilter before aligning the paired-end reads:
obi ngsfilter -t wolf/ngsfile -u wolf/unidentified_sequences -R wolf/reads2 wolf/reads1 wolf/identified_sequences
### 3. Recover the full sequences from the partial forward and reverse reads
obi alignpairedend wolf/identified_sequences wolf/aligned_reads
### 4. Remove unaligned sequence records
obi grep -p "mode!=b'joined'" wolf/aligned_reads wolf/good_sequences
### 5. Dereplicate reads into unique sequences
obi uniq -m sample wolf/good_sequences wolf/dereplicated_sequences
### 6. Denoise the sequence dataset
1. First let's clean the useless metadata and keep only the `COUNT` and `merged_sample` (count by sample) tags:
obi annotate -k COUNT -k merged_sample wolf/dereplicated_sequences wolf/cleaned_metadata_sequences
2. Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp:
obi grep -p "len(sequence)>=80 and sequence['COUNT']>=10" wolf/cleaned_metadata_sequences wolf/denoised_sequences
3. Clean the sequences from PCR/sequencing errors (sequence variants):
obi clean -s merged_sample -r 0.05 -H wolf/denoised_sequences wolf/cleaned_sequences
### 7. Taxonomic assignment of the sequences
#### Build a reference database
1. Download the sequences, or rather, just the files with mammal sequences for this tutorial:
mkdir EMBL
cd EMBL
wget -nH --cut-dirs=5 -Arel_std_mam_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/std/
cd ..
2. Import the sequences in the DMS:
obi import --embl EMBL wolf/embl_refs
For EMBL files, you can give the path to a directory with several EMBL files.
3. Download the taxonomy:
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
4. Import the taxonomy in the DMS:
obi import --taxdump taxdump.tar.gz wolf/taxonomy/my_tax
5. Use ecoPCR to simulate an *in silico* PCR with the V05 primers:
obi ecopcr -e 3 -l 50 -L 150 -F TTAGATACCCCACTATGC -R TAGAACAGGCTCCTCTAG --taxonomy wolf/taxonomy/my_tax wolf/embl_refs wolf/v05_db
#### Clean the database
1. Filter sequences so that they have a good taxonomic description at the species, genus, and family levels:
obi grep --require-rank=species --require-rank=genus --require-rank=family --taxonomy wolf/taxonomy/my_tax wolf/v05_db wolf/v05_db_clean
2. Build the reference database specifically used by the OBITools3 to make ecotag efficient:
obi build_ref_db --taxonomy wolf/taxonomy/my_tax wolf/v05_db_clean wolf/v05_db_definitive
#### Assign each sequence to a taxon
Once the reference database is built, taxonomic assignment can be done using the `ecotag` command:
obi ecotag -m 0.95 --taxonomy wolf/taxonomy/my_tax -R wolf/v05_db_definitive wolf/cleaned_sequences wolf/assigned_sequences
### 8. After the taxonomic assignment
#### Take a look at the results
For example:
obi stats -c SCIENTIFIC_NAME wolf/assigned_sequences
#### Align the sequences
obi align -t 0.95 wolf/assigned_sequences wolf/aligned_assigned_sequences
#### Check the history of everything that was done
obi history
#### Export the results
Export in fasta format:
obi export --fasta-output wolf/assigned_sequences wolf_results.fasta
Export in csv-like format (very soon to be implemented :)):
obi export --tab-output wolf/assigned_sequences wolf_results.fasta
### Contact
<celine.mercier@metabarcoding.org>