mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-06-29 16:20:46 +00:00
564 lines
22 KiB
Plaintext
564 lines
22 KiB
Plaintext
|
# OBITools V4 Tutorial
|
|||
|
|
|||
|
Here is a short tutorial on how to analyze DNA metabarcoding data produced on Illumina sequencers using:
|
|||
|
|
|||
|
- the OBITools
|
|||
|
- some basic Unix commands
|
|||
|
|
|||
|
## Wolves’ diet based on DNA metabarcoding
|
|||
|
|
|||
|
The data used in this tutorial correspond to the analysis of four wolf scats, using the protocol published in @Shehzad2012-pn for assessing carnivore diet. After extracting DNA from the faeces, the DNA amplifications were carried out using the primers `TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG` amplifiying the *12S-V5* region [@Riaz2011-gn], together with a wolf blocking oligonucleotide.
|
|||
|
|
|||
|
The complete data set can be downloaded here: [the tutorial dataset](wolf_diet.tgz)
|
|||
|
|
|||
|
Once the data file is downloaded, using a UNIX terminal unarchive the data from the `tgz` file.
|
|||
|
|
|||
|
```{bash untar_data}
|
|||
|
#| output: false
|
|||
|
tar zxvf wolf_diet.tgz
|
|||
|
```
|
|||
|
|
|||
|
That command create a new directory named `wolf_data` containing every required data files:
|
|||
|
|
|||
|
- `fastq <fastq>` files resulting of aGA IIx (Illumina) paired-end (2 x 108 bp)
|
|||
|
sequencing assay of DNA extracted and amplified from four wolf faeces:
|
|||
|
|
|||
|
- `wolf_F.fastq`
|
|||
|
- `wolf_R.fastq`
|
|||
|
|
|||
|
- the file describing the primers and tags used for all samples
|
|||
|
sequenced:
|
|||
|
|
|||
|
- `wolf_diet_ngsfilter.txt` The tags correspond to short and
|
|||
|
specific sequences added on the 5\' end of each primer to
|
|||
|
distinguish the different samples
|
|||
|
|
|||
|
- the file containing the reference database in a fasta format:
|
|||
|
|
|||
|
- `db_v05_r117.fasta` This reference database has been extracted
|
|||
|
from the release 117 of EMBL using `obipcr`
|
|||
|
|
|||
|
```{bash true_mk_directory}
|
|||
|
#| output: false
|
|||
|
#| echo: false
|
|||
|
#| error: true
|
|||
|
#|
|
|||
|
if [[ ! -d results ]] ; then
|
|||
|
mkdir results
|
|||
|
fi
|
|||
|
```
|
|||
|
|
|||
|
To not mix raw data and processed data a new directory called `results` is created.
|
|||
|
|
|||
|
```{bash mk_directory}
|
|||
|
#| output: false
|
|||
|
#| eval: false
|
|||
|
mkdir results
|
|||
|
```
|
|||
|
|
|||
|
## Step by step analysis
|
|||
|
|
|||
|
### Recover full sequence reads from forward and reverse partial reads
|
|||
|
|
|||
|
When using the result of a paired-end sequencing assay with supposedly
|
|||
|
overlapping forward and reverse reads, the first step is to recover the
|
|||
|
assembled sequence.
|
|||
|
|
|||
|
The forward and reverse reads of the same fragment are *at the same line
|
|||
|
position* in the two fastq files obtained after sequencing. Based on
|
|||
|
these two files, the assembly of the forward and reverse reads is done
|
|||
|
with the `obipairing` utility that aligns the two reads and returns the
|
|||
|
reconstructed sequence.
|
|||
|
|
|||
|
In our case, the command is:
|
|||
|
|
|||
|
```{bash pairing}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obipairing --min-identity=0.8 \
|
|||
|
--min-overlap=10 \
|
|||
|
-F wolf_data/wolf_F.fastq \
|
|||
|
-R wolf_data/wolf_R.fastq \
|
|||
|
> results/wolf.fastq
|
|||
|
```
|
|||
|
|
|||
|
The `--min-identity` and `--min-overlap` options allow
|
|||
|
discarding sequences with low alignment quality. If after the aligment,
|
|||
|
the overlaping parts of the reads is shorter than 10 base pairs or the
|
|||
|
similarity over this aligned region is below 80% of identity, in the output file,
|
|||
|
the forward and reverse reads are not aligned but concatenated, and the value of
|
|||
|
the `mode` attribute in the sequence header is set to `joined` instead of `alignment`.
|
|||
|
|
|||
|
### Remove unaligned sequence records
|
|||
|
|
|||
|
Unaligned sequences (:py`mode=joined`{.interpreted-text role="mod"})
|
|||
|
cannot be used. The following command allows removing them from the
|
|||
|
dataset:
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obigrep -p 'annotations.mode != "join"' \
|
|||
|
results/wolf.fastq > results/wolf.ali.fastq
|
|||
|
```
|
|||
|
|
|||
|
The `-p` requires a go like expression. `annotations.mode != "join"` means that
|
|||
|
if the value of the `mode` annotation of a sequence is
|
|||
|
different from `join`, the corresponding sequence record will be kept.
|
|||
|
|
|||
|
The first sequence record of `wolf.ali.fastq` can be obtained using the
|
|||
|
following command line:
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| eval: false
|
|||
|
#| output: false
|
|||
|
|
|||
|
head -n 4 results/wolf.ali.fastq
|
|||
|
```
|
|||
|
|
|||
|
The folling piece of code appears on thew window of tour terminal.
|
|||
|
|
|||
|
```
|
|||
|
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"ali_dir":"left","ali_length":62,"mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":62,"(T:34)->(G:18)":48},"score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
|
|||
|
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
|
|||
|
+
|
|||
|
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
|
|||
|
```
|
|||
|
|
|||
|
### Assign each sequence record to the corresponding sample/marker combination
|
|||
|
|
|||
|
Each sequence record is assigned to its corresponding sample and marker
|
|||
|
using the data provided in a text file (here `wolf_diet_ngsfilter.txt`).
|
|||
|
This text file contains one line per sample, with the name of the
|
|||
|
experiment (several experiments can be included in the same file), the
|
|||
|
name of the tags (for example: `aattaac` if the same tag has been used
|
|||
|
on each extremity of the PCR products, or `aattaac:gaagtag` if the tags
|
|||
|
were different), the sequence of the forward primer, the sequence of the
|
|||
|
reverse primer, the letter `T` or `F` for sample identification using
|
|||
|
the forward primer and tag only or using both primers and both tags,
|
|||
|
respectively (see `obimultiplex` for details).
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obimultiplex -t wolf_data/wolf_diet_ngsfilter.txt \
|
|||
|
-u results/unidentified.fastq \
|
|||
|
results/wolf.ali.fastq \
|
|||
|
> results/wolf.ali.assigned.fastq
|
|||
|
```
|
|||
|
|
|||
|
This command creates two files:
|
|||
|
|
|||
|
- `unidentified.fastq` containing all the sequence records that were
|
|||
|
not assigned to a sample/marker combination
|
|||
|
- `wolf.ali.assigned.fastq` containing all the sequence records that
|
|||
|
were properly assigned to a sample/marker combination
|
|||
|
|
|||
|
Note that each sequence record of the `wolf.ali.assigned.fastq` file
|
|||
|
contains only the barcode sequence as the sequences of primers and tags
|
|||
|
are removed by the `obimultiplex ` program. Information concerning the
|
|||
|
experiment, sample, primers and tags is added as attributes in the
|
|||
|
sequence header.
|
|||
|
|
|||
|
For instance, the first sequence record of `wolf.ali.assigned.fastq` is:
|
|||
|
|
|||
|
```
|
|||
|
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"ali_dir":"left","ali_length":62,"direction":"direct","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gcctcct","mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":35,"(T:34)->(G:18)":21},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gcctcct","sample":"29a_F260619","score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
|
|||
|
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
|
|||
|
+
|
|||
|
CCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
|
|||
|
```
|
|||
|
|
|||
|
### Dereplicate reads into uniq sequences
|
|||
|
|
|||
|
The same DNA molecule can be sequenced several times. In order to reduce
|
|||
|
both file size and computations time, and to get easier interpretable
|
|||
|
results, it is convenient to work with unique *sequences* instead of
|
|||
|
*reads*. To *dereplicate* such *reads* into unique *sequences*, we use
|
|||
|
the `obiuniq` command.
|
|||
|
|
|||
|
+-------------------------------------------------------------+
|
|||
|
| Definition: Dereplicate reads into unique sequences |
|
|||
|
+-------------------------------------------------------------+
|
|||
|
| 1. compare all the reads in a data set to each other |
|
|||
|
| 2. group strictly identical reads together |
|
|||
|
| 3. output the sequence for each group and its count in the |
|
|||
|
| original dataset (in this way, all duplicated reads are |
|
|||
|
| removed) |
|
|||
|
| |
|
|||
|
| Definition adapted from @Seguritan2001-tg |
|
|||
|
+-------------------------------------------------------------+
|
|||
|
|
|||
|
For dereplication, we use the `obiuniq ` command with the `-m sample`. The `-m sample` option is used
|
|||
|
to keep the information of the samples of origin for each uniquesequence.
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obiuniq -m sample \
|
|||
|
results/wolf.ali.assigned.fastq \
|
|||
|
> results/wolf.ali.assigned.uniq.fasta
|
|||
|
```
|
|||
|
|
|||
|
Note that `obiuniq` returns a fasta file.
|
|||
|
|
|||
|
The first sequence record of `wolf.ali.assigned.uniq.fasta` is:
|
|||
|
|
|||
|
```
|
|||
|
>HELIUM_000100422_612GNAAXX:7:93:6991:1942#0/1_sub[28..126] {"ali_dir":"left","ali_length":63,"count":1,"direction":"reverse","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gaatatc","merged_sample":{"26a_F040644":1},"mode":"alignment","pairing_mismatches":{"(A:10)->(G:34)":76,"(C:06)->(A:34)":58},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gaatatc","score":730,"score_norm":0.968,"seq_a_single":45,"seq_ab_match":61,"seq_b_single":45}
|
|||
|
ttagccctaaacataaacattcaataaacaagaatgttcgccagagaactactagcaaca
|
|||
|
gcctgaaactcaaaggacttggcggtgctttatatccct
|
|||
|
```
|
|||
|
|
|||
|
The run of `obiuniq` has
|
|||
|
added two key=values entries in the header of the fasta sequence:
|
|||
|
|
|||
|
- `"merged_sample":{"29a_F260619":1}`{.interpreted-text
|
|||
|
role="mod"}: this sequence have been found once in a single sample
|
|||
|
called **29a_F260619**
|
|||
|
- `"count":1` : the total count for this sequence is $1$
|
|||
|
|
|||
|
To keep only these two attributes, we can use the `obiannotate` command:
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obiannotate -k count -k merged_sample \
|
|||
|
results/wolf.ali.assigned.uniq.fasta \
|
|||
|
> results/wolf.ali.assigned.simple.fasta
|
|||
|
```
|
|||
|
|
|||
|
The first five sequence records of `wolf.ali.assigned.simple.fasta`
|
|||
|
become:
|
|||
|
|
|||
|
```
|
|||
|
>HELIUM_000100422_612GNAAXX:7:26:18930:11105#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
|||
|
ttagccctaaacacaagtaattaatataacaaaatwattcgcyagagtactacmggcaat
|
|||
|
agctyaaarctcamagrwcttggcggtgctttataccctt
|
|||
|
>HELIUM_000100422_612GNAAXX:7:58:5711:11399#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
|||
|
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtwctaccgssaat
|
|||
|
agcttaaaactcaaaggactgggcggtgctttataccctt
|
|||
|
>HELIUM_000100422_612GNAAXX:7:100:15836:9304#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
|||
|
ttagccctaaacatagataattacacaaacaaaattgttcaccagagtactagcggcaac
|
|||
|
agcttaaaactcaaaggacttggcggtgctttataccctt
|
|||
|
>HELIUM_000100422_612GNAAXX:7:55:13242:9085#0/1_sub[28..126] {"count":4,"merged_sample":{"26a_F040644":4}}
|
|||
|
ttagccctaaacataaacattcaataaacaagagtgttcgccagagtactactagcaaca
|
|||
|
gcctgaaactcaaaggacttggcggtgctttacatccct
|
|||
|
>HELIUM_000100422_612GNAAXX:7:86:8429:13723#0/1_sub[28..127] {"count":7,"merged_sample":{"15a_F730814":5,"29a_F260619":2}}
|
|||
|
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
|||
|
agcttaaaactcaaaggactcggcggtgctttataccctt
|
|||
|
```
|
|||
|
|
|||
|
### Denoise the sequence dataset
|
|||
|
|
|||
|
To have a set of sequences assigned to their corresponding samples does
|
|||
|
not mean that all sequences are *biologically* meaningful i.e. some of
|
|||
|
these sequences can contains PCR and/or sequencing errors, or chimeras.
|
|||
|
|
|||
|
#### Tag the sequences for PCR errors (sequence variants) {.unnumbered}
|
|||
|
|
|||
|
The `obiclean` program tags sequence variants as potential error generated during
|
|||
|
PCR amplification. We ask it to keep the [head]{.title-ref} sequences (`-H` option)
|
|||
|
that are sequences which are not variants of another sequence with a count greater than 5% of their own count
|
|||
|
(`-r 0.05` option).
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obiclean -s sample -r 0.05 -H \
|
|||
|
results/wolf.ali.assigned.simple.fasta \
|
|||
|
> results/wolf.ali.assigned.simple.clean.fasta
|
|||
|
```
|
|||
|
|
|||
|
One of the sequence records of
|
|||
|
`wolf.ali.assigned.simple.clean.fasta` is:
|
|||
|
|
|||
|
```
|
|||
|
>HELIUM_000100422_612GNAAXX:7:66:4039:8016#0/1_sub[28..127] {"count":17,"merged_sample":{"13a_F730603":17},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obi
|
|||
|
clean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"13a_F730603":"h"},"obiclean_weight":{"13a_F730603":25}}
|
|||
|
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaac
|
|||
|
agcccaaaactcaaaggacttggcggtgcttcacaccctt
|
|||
|
```
|
|||
|
|
|||
|
To remove such sequences as much as possible, we first discard rare
|
|||
|
sequences and then rsequence variants that likely correspond to
|
|||
|
artifacts.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
#### Get some statistics about sequence counts {.unnumbered}
|
|||
|
|
|||
|
```{bash}
|
|||
|
obicount results/wolf.ali.assigned.simple.clean.fasta
|
|||
|
```
|
|||
|
|
|||
|
The dataset contains $4313$ sequences variant corresponding to 42452 sequence reads.
|
|||
|
Most of the variants occur only a single time in the complete dataset and are usualy
|
|||
|
named *singletons*
|
|||
|
|
|||
|
```{bash}
|
|||
|
obigrep -p 'sequence.Count() == 1' results/wolf.ali.assigned.simple.clean.fasta \
|
|||
|
| obicount
|
|||
|
```
|
|||
|
|
|||
|
In that dataset sigletons corresponds to $3511$ variants.
|
|||
|
|
|||
|
Using *R* and the `ROBIFastread` package able to read headers of the fasta files produced by *OBITools*,
|
|||
|
we can get more complete statistics on the distribution of occurrencies.
|
|||
|
|
|||
|
```{r}
|
|||
|
#| warning: false
|
|||
|
library(ROBIFastread)
|
|||
|
library(ggplot2)
|
|||
|
|
|||
|
seqs <- read_obifasta("results/wolf.ali.assigned.simple.clean.fasta",keys="count")
|
|||
|
|
|||
|
ggplot(data = seqs, mapping=aes(x = count)) +
|
|||
|
geom_histogram(bins=100) +
|
|||
|
scale_y_sqrt() +
|
|||
|
scale_x_sqrt() +
|
|||
|
geom_vline(xintercept = 10, col="red", lty=2) +
|
|||
|
xlab("number of occurrencies of a variant")
|
|||
|
```
|
|||
|
|
|||
|
In a similar way it is also possible to plot the distribution of the sequence length.
|
|||
|
|
|||
|
```{r}
|
|||
|
#| warning: false
|
|||
|
ggplot(data = seqs, mapping=aes(x = nchar(sequence))) +
|
|||
|
geom_histogram() +
|
|||
|
scale_y_log10() +
|
|||
|
geom_vline(xintercept = 80, col="red", lty=2) +
|
|||
|
xlab("sequence lengths in base pair")
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
#### Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp {.unnumbered}
|
|||
|
|
|||
|
Based on the previous observation, we set the cut-off for keeping
|
|||
|
sequences for further analysis to a count of 10. To do this, we use the
|
|||
|
`obigrep <scripts/obigrep>`{.interpreted-text role="doc"} command. The
|
|||
|
`-p 'count>=10'` option means that the `python` expression
|
|||
|
:py`count>=10`{.interpreted-text role="mod"} must be evaluated to
|
|||
|
:py`True`{.interpreted-text role="mod"} for each sequence to be kept.
|
|||
|
Based on previous knowledge we also remove sequences with a length
|
|||
|
shorter than 80 bp (option -l) as we know that the amplified 12S-V5
|
|||
|
barcode for vertebrates must have a length around 100bp.
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
|
|||
|
obigrep -l 80 -p 'sequence.Count() >= 10' results/wolf.ali.assigned.simple.clean.fasta \
|
|||
|
> results/wolf.ali.assigned.simple.clean.c10.l80.fasta
|
|||
|
```
|
|||
|
|
|||
|
The first sequence record of `results/wolf.ali.assigned.simple.clean.c10.l80.fasta` is:
|
|||
|
|
|||
|
```
|
|||
|
>HELIUM_000100422_612GNAAXX:7:22:2603:18023#0/1_sub[28..127] {"count":12182,"merged_sample":{"15a_F730814":7559,"29a_F260619":4623},"obiclean_head":true,"obiclean_headcount":2,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":0,"obiclean_status":{"15a_F730814":"h","29a_F260619":"h"},"obiclean_weight":{"15a_F730814":9165,"29a_F260619":6275}}
|
|||
|
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
|||
|
agcttaaaactcaaaggacttggcggtgctttataccctt
|
|||
|
```
|
|||
|
|
|||
|
At that time in the data cleanning we have conserved :
|
|||
|
|
|||
|
```{bash}
|
|||
|
obicount results/wolf.ali.assigned.simple.clean.c10.l80.fasta
|
|||
|
```
|
|||
|
|
|||
|
### Taxonomic assignment of sequences
|
|||
|
|
|||
|
Once denoising has been done, the next step in diet analysis is to
|
|||
|
assign the barcodes to the corresponding species in order to get the
|
|||
|
complete list of species associated to each sample.
|
|||
|
|
|||
|
Taxonomic assignment of sequences requires a reference database
|
|||
|
compiling all possible species to be identified in the sample.
|
|||
|
Assignment is then done based on sequence comparison between sample
|
|||
|
sequences and reference sequences.
|
|||
|
|
|||
|
#### Download the taxonomy {.unnumbered}
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
mkdir TAXO
|
|||
|
cd TAXO
|
|||
|
curl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \
|
|||
|
| tar -zxvf -
|
|||
|
cd ..
|
|||
|
```
|
|||
|
|
|||
|
#### Build a reference database {.unnumbered}
|
|||
|
|
|||
|
One way to build the reference database is to use the
|
|||
|
`ecoPCR <scripts/ecoPCR>`{.interpreted-text role="doc"} program to
|
|||
|
simulate a PCR and to extract all sequences from the EMBL that may be
|
|||
|
amplified [in silico]{.title-ref} by the two primers
|
|||
|
([TTAGATACCCCACTATGC]{.title-ref} and [TAGAACAGGCTCCTCTAG]{.title-ref})
|
|||
|
used for PCR amplification.
|
|||
|
|
|||
|
The full list of steps for building this reference database would then
|
|||
|
be:
|
|||
|
|
|||
|
1. Download the whole set of EMBL sequences (available from:
|
|||
|
<ftp://ftp.ebi.ac.uk/pub/databases/embl/release/>)
|
|||
|
2. Download the NCBI taxonomy (available from:
|
|||
|
<ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz>)
|
|||
|
3. Format them into the ecoPCR format (see
|
|||
|
`obiconvert <scripts/obiconvert>`{.interpreted-text role="doc"} for
|
|||
|
how you can produce ecoPCR compatible files)
|
|||
|
4. Use ecoPCR to simulate amplification and build a reference database
|
|||
|
based on putatively amplified barcodes together with their recorded
|
|||
|
taxonomic information
|
|||
|
|
|||
|
As step 1 and step 3 can be really time-consuming (about one day), we
|
|||
|
alredy provide the reference database produced by the following commands
|
|||
|
so that you can skip its construction. Note that as the EMBL database
|
|||
|
and taxonomic data can evolve daily, if you run the following commands
|
|||
|
you may end up with quite different results.
|
|||
|
|
|||
|
Any utility allowing file downloading from a ftp site can be used. In
|
|||
|
the following commands, we use the commonly used `wget` *Unix* command.
|
|||
|
|
|||
|
##### Download the sequences {.unnumbered}
|
|||
|
|
|||
|
``` bash
|
|||
|
> mkdir EMBL
|
|||
|
> cd EMBL
|
|||
|
> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
|
|||
|
> cd ..
|
|||
|
```
|
|||
|
|
|||
|
##### Download the taxonomy {.unnumbered}
|
|||
|
|
|||
|
``` bash
|
|||
|
> mkdir TAXO
|
|||
|
> cd TAXO
|
|||
|
> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
|
|||
|
> tar -zxvf taxdump.tar.gz
|
|||
|
> cd ..
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
##### Use obipcr to simulate an in silico\` PCR {.unnumbered}
|
|||
|
|
|||
|
``` bash
|
|||
|
> obipcr -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
|
|||
|
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
|
|||
|
```
|
|||
|
|
|||
|
Note that the primers must be in the same order both in
|
|||
|
`wolf_diet_ngsfilter.txt` and in the `obipcr` command.
|
|||
|
|
|||
|
##### Clean the database {.unnumbered}
|
|||
|
|
|||
|
1. filter sequences so that they have a good taxonomic description at
|
|||
|
the species, genus, and family levels
|
|||
|
(`obigrep` command command below).
|
|||
|
2. remove redundant sequences (`obiuniq` command below).
|
|||
|
3. ensure that the dereplicated sequences have a taxid at the family
|
|||
|
level (`obigrep` command below).
|
|||
|
4. ensure that sequences each have a unique identification
|
|||
|
(`obiannotate` command below)
|
|||
|
|
|||
|
``` bash
|
|||
|
> obigrep -d embl_last --require-rank=species \
|
|||
|
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
|
|||
|
|
|||
|
> obiuniq -d embl_last \
|
|||
|
v05_clean.fasta > v05_clean_uniq.fasta
|
|||
|
|
|||
|
> obigrep -d embl_last --require-rank=family \
|
|||
|
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
|
|||
|
|
|||
|
> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
|
|||
|
```
|
|||
|
|
|||
|
obirefidx -t TAXO wolf_data/db_v05_r117.fasta > results/db_v05_r117.indexed.fasta
|
|||
|
|
|||
|
|
|||
|
::: warning
|
|||
|
::: title
|
|||
|
Warning
|
|||
|
:::
|
|||
|
|
|||
|
From now on, for the sake of clarity, the following commands will use
|
|||
|
the filenames of the files provided with the tutorial. If you decided to
|
|||
|
run the last steps and use the files you have produced, you\'ll have to
|
|||
|
use `db_v05.fasta` instead of `db_v05_r117.fasta` and `embl_last`
|
|||
|
instead of `embl_r117`
|
|||
|
:::
|
|||
|
|
|||
|
### Assign each sequence to a taxon
|
|||
|
|
|||
|
Once the reference database is built, taxonomic assignment can be
|
|||
|
carried out using the `obitag` command.
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
obitag -t TAXO -R wolf_data/db_v05_r117.indexed.fasta \
|
|||
|
results/wolf.ali.assigned.simple.clean.c10.l80.fasta \
|
|||
|
> results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta
|
|||
|
```
|
|||
|
|
|||
|
The `obitag` adds several attributes in the sequence record header, among
|
|||
|
them:
|
|||
|
|
|||
|
- obitag_bestmatch=ACCESSION where ACCESSION is the id of hte sequence in
|
|||
|
the reference database that best aligns to the query sequence;
|
|||
|
- obitag_bestid=FLOAT where FLOAT\*100 is the percentage of identity
|
|||
|
between the best match sequence and the query sequence;
|
|||
|
- taxid=TAXID where TAXID is the final assignation of the sequence by
|
|||
|
`obitag`
|
|||
|
- scientific_name=NAME where NAME is the scientific name of the
|
|||
|
assigned taxid.
|
|||
|
|
|||
|
The first sequence record of `wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta` is:
|
|||
|
|
|||
|
``` bash
|
|||
|
>HELIUM_000100422_612GNAAXX:7:81:18704:12346#0/1_sub[28..126] {"count":88,"merged_sample":{"26a_F040644":88},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":208},"obitag_bestid":0.9207920792079208,"obitag_bestmatch":"AY769263","obitag_difference":8,"obitag_match_count":1,"obitag_rank":"clade","scientific_name":"Boreoeutheria","taxid":1437010}
|
|||
|
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
|
|||
|
gcttaaaactcaaaggacttggcggtgctttatatccct
|
|||
|
```
|
|||
|
|
|||
|
### Generate the final result table
|
|||
|
|
|||
|
Some unuseful attributes can be removed at this stage.
|
|||
|
|
|||
|
- obiclean_head
|
|||
|
- obiclean_headcount
|
|||
|
- obiclean_internalcount
|
|||
|
- obiclean_samplecount
|
|||
|
- obiclean_singletoncount
|
|||
|
|
|||
|
```{bash}
|
|||
|
#| output: false
|
|||
|
obiannotate --delete-tag=obiclean_head \
|
|||
|
--delete-tag=obiclean_headcount \
|
|||
|
--delete-tag=obiclean_internalcount \
|
|||
|
--delete-tag=obiclean_samplecount \
|
|||
|
--delete-tag=obiclean_singletoncount \
|
|||
|
results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta \
|
|||
|
> results/wolf.ali.assigned.simple.clean.c10.l80.taxo.ann.fasta
|
|||
|
```
|
|||
|
|
|||
|
The first sequence record of
|
|||
|
`wolf.ali.assigned.simple.c10.l80.clean.taxo.ann.fasta` is then:
|
|||
|
|
|||
|
```
|
|||
|
>HELIUM_000100422_612GNAAXX:7:84:16335:5083#0/1_sub[28..126] {"count":96,"merged_sample":{"26a_F040644":11,"29a_F260619":85},"obiclean_status":{"26a_F040644":"s","29a_F260619":"h"},"obiclean_weight":{"26a_F040644":14,"29a_F260619":110},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_difference":4,"obitag_match_count":1,"obitag_rank":"subspecies","scientific_name":"Canis lupus familiaris","taxid":9615}
|
|||
|
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
|
|||
|
gattaaacctcaaaggacttggcagtgctttatacccct
|
|||
|
```
|
|||
|
|
|||
|
This file contains 26 sequences. You can deduce the diet of each sample:
|
|||
|
|
|||
|
: - 13a_F730603: Cervus elaphus
|
|||
|
- 15a_F730814: Capreolus capreolus
|
|||
|
- 26a_F040644: Marmota sp. (according to the location, it is
|
|||
|
Marmota marmota)
|
|||
|
- 29a_F260619: Capreolus capreolus
|
|||
|
|
|||
|
Note that we also obtained a few wolf sequences although a wolf-blocking
|
|||
|
oligonucleotide was used.
|