mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-06-29 16:20:46 +00:00
564 lines
22 KiB
Plaintext
564 lines
22 KiB
Plaintext
# OBITools V4 Tutorial
|
||
|
||
Here is a short tutorial on how to analyze DNA metabarcoding data produced on Illumina sequencers using:
|
||
|
||
- the OBITools
|
||
- some basic Unix commands
|
||
|
||
## Wolves’ diet based on DNA metabarcoding
|
||
|
||
The data used in this tutorial correspond to the analysis of four wolf scats, using the protocol published in @Shehzad2012-pn for assessing carnivore diet. After extracting DNA from the faeces, the DNA amplifications were carried out using the primers `TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG` amplifiying the *12S-V5* region [@Riaz2011-gn], together with a wolf blocking oligonucleotide.
|
||
|
||
The complete data set can be downloaded here: [the tutorial dataset](wolf_diet.tgz)
|
||
|
||
Once the data file is downloaded, using a UNIX terminal unarchive the data from the `tgz` file.
|
||
|
||
```{bash untar_data}
|
||
#| output: false
|
||
tar zxvf wolf_diet.tgz
|
||
```
|
||
|
||
That command create a new directory named `wolf_data` containing every required data files:
|
||
|
||
- `fastq <fastq>` files resulting of aGA IIx (Illumina) paired-end (2 x 108 bp)
|
||
sequencing assay of DNA extracted and amplified from four wolf faeces:
|
||
|
||
- `wolf_F.fastq`
|
||
- `wolf_R.fastq`
|
||
|
||
- the file describing the primers and tags used for all samples
|
||
sequenced:
|
||
|
||
- `wolf_diet_ngsfilter.txt` The tags correspond to short and
|
||
specific sequences added on the 5\' end of each primer to
|
||
distinguish the different samples
|
||
|
||
- the file containing the reference database in a fasta format:
|
||
|
||
- `db_v05_r117.fasta` This reference database has been extracted
|
||
from the release 117 of EMBL using `obipcr`
|
||
|
||
```{bash true_mk_directory}
|
||
#| output: false
|
||
#| echo: false
|
||
#| error: true
|
||
#|
|
||
if [[ ! -d results ]] ; then
|
||
mkdir results
|
||
fi
|
||
```
|
||
|
||
To not mix raw data and processed data a new directory called `results` is created.
|
||
|
||
```{bash mk_directory}
|
||
#| output: false
|
||
#| eval: false
|
||
mkdir results
|
||
```
|
||
|
||
## Step by step analysis
|
||
|
||
### Recover full sequence reads from forward and reverse partial reads
|
||
|
||
When using the result of a paired-end sequencing assay with supposedly
|
||
overlapping forward and reverse reads, the first step is to recover the
|
||
assembled sequence.
|
||
|
||
The forward and reverse reads of the same fragment are *at the same line
|
||
position* in the two fastq files obtained after sequencing. Based on
|
||
these two files, the assembly of the forward and reverse reads is done
|
||
with the `obipairing` utility that aligns the two reads and returns the
|
||
reconstructed sequence.
|
||
|
||
In our case, the command is:
|
||
|
||
```{bash pairing}
|
||
#| output: false
|
||
|
||
obipairing --min-identity=0.8 \
|
||
--min-overlap=10 \
|
||
-F wolf_data/wolf_F.fastq \
|
||
-R wolf_data/wolf_R.fastq \
|
||
> results/wolf.fastq
|
||
```
|
||
|
||
The `--min-identity` and `--min-overlap` options allow
|
||
discarding sequences with low alignment quality. If after the aligment,
|
||
the overlaping parts of the reads is shorter than 10 base pairs or the
|
||
similarity over this aligned region is below 80% of identity, in the output file,
|
||
the forward and reverse reads are not aligned but concatenated, and the value of
|
||
the `mode` attribute in the sequence header is set to `joined` instead of `alignment`.
|
||
|
||
### Remove unaligned sequence records
|
||
|
||
Unaligned sequences (:py`mode=joined`{.interpreted-text role="mod"})
|
||
cannot be used. The following command allows removing them from the
|
||
dataset:
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obigrep -p 'annotations.mode != "join"' \
|
||
results/wolf.fastq > results/wolf.ali.fastq
|
||
```
|
||
|
||
The `-p` requires a go like expression. `annotations.mode != "join"` means that
|
||
if the value of the `mode` annotation of a sequence is
|
||
different from `join`, the corresponding sequence record will be kept.
|
||
|
||
The first sequence record of `wolf.ali.fastq` can be obtained using the
|
||
following command line:
|
||
|
||
```{bash}
|
||
#| eval: false
|
||
#| output: false
|
||
|
||
head -n 4 results/wolf.ali.fastq
|
||
```
|
||
|
||
The folling piece of code appears on thew window of tour terminal.
|
||
|
||
```
|
||
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"ali_dir":"left","ali_length":62,"mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":62,"(T:34)->(G:18)":48},"score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
|
||
ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
|
||
+
|
||
CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
|
||
```
|
||
|
||
### Assign each sequence record to the corresponding sample/marker combination
|
||
|
||
Each sequence record is assigned to its corresponding sample and marker
|
||
using the data provided in a text file (here `wolf_diet_ngsfilter.txt`).
|
||
This text file contains one line per sample, with the name of the
|
||
experiment (several experiments can be included in the same file), the
|
||
name of the tags (for example: `aattaac` if the same tag has been used
|
||
on each extremity of the PCR products, or `aattaac:gaagtag` if the tags
|
||
were different), the sequence of the forward primer, the sequence of the
|
||
reverse primer, the letter `T` or `F` for sample identification using
|
||
the forward primer and tag only or using both primers and both tags,
|
||
respectively (see `obimultiplex` for details).
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obimultiplex -t wolf_data/wolf_diet_ngsfilter.txt \
|
||
-u results/unidentified.fastq \
|
||
results/wolf.ali.fastq \
|
||
> results/wolf.ali.assigned.fastq
|
||
```
|
||
|
||
This command creates two files:
|
||
|
||
- `unidentified.fastq` containing all the sequence records that were
|
||
not assigned to a sample/marker combination
|
||
- `wolf.ali.assigned.fastq` containing all the sequence records that
|
||
were properly assigned to a sample/marker combination
|
||
|
||
Note that each sequence record of the `wolf.ali.assigned.fastq` file
|
||
contains only the barcode sequence as the sequences of primers and tags
|
||
are removed by the `obimultiplex ` program. Information concerning the
|
||
experiment, sample, primers and tags is added as attributes in the
|
||
sequence header.
|
||
|
||
For instance, the first sequence record of `wolf.ali.assigned.fastq` is:
|
||
|
||
```
|
||
@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"ali_dir":"left","ali_length":62,"direction":"direct","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gcctcct","mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":35,"(T:34)->(G:18)":21},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gcctcct","sample":"29a_F260619","score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
|
||
ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
|
||
+
|
||
CCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
|
||
```
|
||
|
||
### Dereplicate reads into uniq sequences
|
||
|
||
The same DNA molecule can be sequenced several times. In order to reduce
|
||
both file size and computations time, and to get easier interpretable
|
||
results, it is convenient to work with unique *sequences* instead of
|
||
*reads*. To *dereplicate* such *reads* into unique *sequences*, we use
|
||
the `obiuniq` command.
|
||
|
||
+-------------------------------------------------------------+
|
||
| Definition: Dereplicate reads into unique sequences |
|
||
+-------------------------------------------------------------+
|
||
| 1. compare all the reads in a data set to each other |
|
||
| 2. group strictly identical reads together |
|
||
| 3. output the sequence for each group and its count in the |
|
||
| original dataset (in this way, all duplicated reads are |
|
||
| removed) |
|
||
| |
|
||
| Definition adapted from @Seguritan2001-tg |
|
||
+-------------------------------------------------------------+
|
||
|
||
For dereplication, we use the `obiuniq ` command with the `-m sample`. The `-m sample` option is used
|
||
to keep the information of the samples of origin for each uniquesequence.
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obiuniq -m sample \
|
||
results/wolf.ali.assigned.fastq \
|
||
> results/wolf.ali.assigned.uniq.fasta
|
||
```
|
||
|
||
Note that `obiuniq` returns a fasta file.
|
||
|
||
The first sequence record of `wolf.ali.assigned.uniq.fasta` is:
|
||
|
||
```
|
||
>HELIUM_000100422_612GNAAXX:7:93:6991:1942#0/1_sub[28..126] {"ali_dir":"left","ali_length":63,"count":1,"direction":"reverse","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gaatatc","merged_sample":{"26a_F040644":1},"mode":"alignment","pairing_mismatches":{"(A:10)->(G:34)":76,"(C:06)->(A:34)":58},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gaatatc","score":730,"score_norm":0.968,"seq_a_single":45,"seq_ab_match":61,"seq_b_single":45}
|
||
ttagccctaaacataaacattcaataaacaagaatgttcgccagagaactactagcaaca
|
||
gcctgaaactcaaaggacttggcggtgctttatatccct
|
||
```
|
||
|
||
The run of `obiuniq` has
|
||
added two key=values entries in the header of the fasta sequence:
|
||
|
||
- `"merged_sample":{"29a_F260619":1}`{.interpreted-text
|
||
role="mod"}: this sequence have been found once in a single sample
|
||
called **29a_F260619**
|
||
- `"count":1` : the total count for this sequence is $1$
|
||
|
||
To keep only these two attributes, we can use the `obiannotate` command:
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obiannotate -k count -k merged_sample \
|
||
results/wolf.ali.assigned.uniq.fasta \
|
||
> results/wolf.ali.assigned.simple.fasta
|
||
```
|
||
|
||
The first five sequence records of `wolf.ali.assigned.simple.fasta`
|
||
become:
|
||
|
||
```
|
||
>HELIUM_000100422_612GNAAXX:7:26:18930:11105#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
||
ttagccctaaacacaagtaattaatataacaaaatwattcgcyagagtactacmggcaat
|
||
agctyaaarctcamagrwcttggcggtgctttataccctt
|
||
>HELIUM_000100422_612GNAAXX:7:58:5711:11399#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtwctaccgssaat
|
||
agcttaaaactcaaaggactgggcggtgctttataccctt
|
||
>HELIUM_000100422_612GNAAXX:7:100:15836:9304#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
|
||
ttagccctaaacatagataattacacaaacaaaattgttcaccagagtactagcggcaac
|
||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||
>HELIUM_000100422_612GNAAXX:7:55:13242:9085#0/1_sub[28..126] {"count":4,"merged_sample":{"26a_F040644":4}}
|
||
ttagccctaaacataaacattcaataaacaagagtgttcgccagagtactactagcaaca
|
||
gcctgaaactcaaaggacttggcggtgctttacatccct
|
||
>HELIUM_000100422_612GNAAXX:7:86:8429:13723#0/1_sub[28..127] {"count":7,"merged_sample":{"15a_F730814":5,"29a_F260619":2}}
|
||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
||
agcttaaaactcaaaggactcggcggtgctttataccctt
|
||
```
|
||
|
||
### Denoise the sequence dataset
|
||
|
||
To have a set of sequences assigned to their corresponding samples does
|
||
not mean that all sequences are *biologically* meaningful i.e. some of
|
||
these sequences can contains PCR and/or sequencing errors, or chimeras.
|
||
|
||
#### Tag the sequences for PCR errors (sequence variants) {.unnumbered}
|
||
|
||
The `obiclean` program tags sequence variants as potential error generated during
|
||
PCR amplification. We ask it to keep the [head]{.title-ref} sequences (`-H` option)
|
||
that are sequences which are not variants of another sequence with a count greater than 5% of their own count
|
||
(`-r 0.05` option).
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obiclean -s sample -r 0.05 -H \
|
||
results/wolf.ali.assigned.simple.fasta \
|
||
> results/wolf.ali.assigned.simple.clean.fasta
|
||
```
|
||
|
||
One of the sequence records of
|
||
`wolf.ali.assigned.simple.clean.fasta` is:
|
||
|
||
```
|
||
>HELIUM_000100422_612GNAAXX:7:66:4039:8016#0/1_sub[28..127] {"count":17,"merged_sample":{"13a_F730603":17},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obi
|
||
clean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"13a_F730603":"h"},"obiclean_weight":{"13a_F730603":25}}
|
||
ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaac
|
||
agcccaaaactcaaaggacttggcggtgcttcacaccctt
|
||
```
|
||
|
||
To remove such sequences as much as possible, we first discard rare
|
||
sequences and then rsequence variants that likely correspond to
|
||
artifacts.
|
||
|
||
|
||
|
||
#### Get some statistics about sequence counts {.unnumbered}
|
||
|
||
```{bash}
|
||
obicount results/wolf.ali.assigned.simple.clean.fasta
|
||
```
|
||
|
||
The dataset contains $4313$ sequences variant corresponding to 42452 sequence reads.
|
||
Most of the variants occur only a single time in the complete dataset and are usualy
|
||
named *singletons*
|
||
|
||
```{bash}
|
||
obigrep -p 'sequence.Count() == 1' results/wolf.ali.assigned.simple.clean.fasta \
|
||
| obicount
|
||
```
|
||
|
||
In that dataset sigletons corresponds to $3511$ variants.
|
||
|
||
Using *R* and the `ROBIFastread` package able to read headers of the fasta files produced by *OBITools*,
|
||
we can get more complete statistics on the distribution of occurrencies.
|
||
|
||
```{r}
|
||
#| warning: false
|
||
library(ROBIFastread)
|
||
library(ggplot2)
|
||
|
||
seqs <- read_obifasta("results/wolf.ali.assigned.simple.clean.fasta",keys="count")
|
||
|
||
ggplot(data = seqs, mapping=aes(x = count)) +
|
||
geom_histogram(bins=100) +
|
||
scale_y_sqrt() +
|
||
scale_x_sqrt() +
|
||
geom_vline(xintercept = 10, col="red", lty=2) +
|
||
xlab("number of occurrencies of a variant")
|
||
```
|
||
|
||
In a similar way it is also possible to plot the distribution of the sequence length.
|
||
|
||
```{r}
|
||
#| warning: false
|
||
ggplot(data = seqs, mapping=aes(x = nchar(sequence))) +
|
||
geom_histogram() +
|
||
scale_y_log10() +
|
||
geom_vline(xintercept = 80, col="red", lty=2) +
|
||
xlab("sequence lengths in base pair")
|
||
```
|
||
|
||
|
||
#### Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp {.unnumbered}
|
||
|
||
Based on the previous observation, we set the cut-off for keeping
|
||
sequences for further analysis to a count of 10. To do this, we use the
|
||
`obigrep <scripts/obigrep>`{.interpreted-text role="doc"} command. The
|
||
`-p 'count>=10'` option means that the `python` expression
|
||
:py`count>=10`{.interpreted-text role="mod"} must be evaluated to
|
||
:py`True`{.interpreted-text role="mod"} for each sequence to be kept.
|
||
Based on previous knowledge we also remove sequences with a length
|
||
shorter than 80 bp (option -l) as we know that the amplified 12S-V5
|
||
barcode for vertebrates must have a length around 100bp.
|
||
|
||
```{bash}
|
||
#| output: false
|
||
|
||
obigrep -l 80 -p 'sequence.Count() >= 10' results/wolf.ali.assigned.simple.clean.fasta \
|
||
> results/wolf.ali.assigned.simple.clean.c10.l80.fasta
|
||
```
|
||
|
||
The first sequence record of `results/wolf.ali.assigned.simple.clean.c10.l80.fasta` is:
|
||
|
||
```
|
||
>HELIUM_000100422_612GNAAXX:7:22:2603:18023#0/1_sub[28..127] {"count":12182,"merged_sample":{"15a_F730814":7559,"29a_F260619":4623},"obiclean_head":true,"obiclean_headcount":2,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":0,"obiclean_status":{"15a_F730814":"h","29a_F260619":"h"},"obiclean_weight":{"15a_F730814":9165,"29a_F260619":6275}}
|
||
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
|
||
agcttaaaactcaaaggacttggcggtgctttataccctt
|
||
```
|
||
|
||
At that time in the data cleanning we have conserved :
|
||
|
||
```{bash}
|
||
obicount results/wolf.ali.assigned.simple.clean.c10.l80.fasta
|
||
```
|
||
|
||
### Taxonomic assignment of sequences
|
||
|
||
Once denoising has been done, the next step in diet analysis is to
|
||
assign the barcodes to the corresponding species in order to get the
|
||
complete list of species associated to each sample.
|
||
|
||
Taxonomic assignment of sequences requires a reference database
|
||
compiling all possible species to be identified in the sample.
|
||
Assignment is then done based on sequence comparison between sample
|
||
sequences and reference sequences.
|
||
|
||
#### Download the taxonomy {.unnumbered}
|
||
|
||
```{bash}
|
||
#| output: false
|
||
mkdir TAXO
|
||
cd TAXO
|
||
curl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \
|
||
| tar -zxvf -
|
||
cd ..
|
||
```
|
||
|
||
#### Build a reference database {.unnumbered}
|
||
|
||
One way to build the reference database is to use the
|
||
`ecoPCR <scripts/ecoPCR>`{.interpreted-text role="doc"} program to
|
||
simulate a PCR and to extract all sequences from the EMBL that may be
|
||
amplified [in silico]{.title-ref} by the two primers
|
||
([TTAGATACCCCACTATGC]{.title-ref} and [TAGAACAGGCTCCTCTAG]{.title-ref})
|
||
used for PCR amplification.
|
||
|
||
The full list of steps for building this reference database would then
|
||
be:
|
||
|
||
1. Download the whole set of EMBL sequences (available from:
|
||
<ftp://ftp.ebi.ac.uk/pub/databases/embl/release/>)
|
||
2. Download the NCBI taxonomy (available from:
|
||
<ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz>)
|
||
3. Format them into the ecoPCR format (see
|
||
`obiconvert <scripts/obiconvert>`{.interpreted-text role="doc"} for
|
||
how you can produce ecoPCR compatible files)
|
||
4. Use ecoPCR to simulate amplification and build a reference database
|
||
based on putatively amplified barcodes together with their recorded
|
||
taxonomic information
|
||
|
||
As step 1 and step 3 can be really time-consuming (about one day), we
|
||
alredy provide the reference database produced by the following commands
|
||
so that you can skip its construction. Note that as the EMBL database
|
||
and taxonomic data can evolve daily, if you run the following commands
|
||
you may end up with quite different results.
|
||
|
||
Any utility allowing file downloading from a ftp site can be used. In
|
||
the following commands, we use the commonly used `wget` *Unix* command.
|
||
|
||
##### Download the sequences {.unnumbered}
|
||
|
||
``` bash
|
||
> mkdir EMBL
|
||
> cd EMBL
|
||
> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
|
||
> cd ..
|
||
```
|
||
|
||
##### Download the taxonomy {.unnumbered}
|
||
|
||
``` bash
|
||
> mkdir TAXO
|
||
> cd TAXO
|
||
> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
|
||
> tar -zxvf taxdump.tar.gz
|
||
> cd ..
|
||
```
|
||
|
||
|
||
##### Use obipcr to simulate an in silico\` PCR {.unnumbered}
|
||
|
||
``` bash
|
||
> obipcr -d ./ECODB/embl_last -e 3 -l 50 -L 150 \
|
||
TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
|
||
```
|
||
|
||
Note that the primers must be in the same order both in
|
||
`wolf_diet_ngsfilter.txt` and in the `obipcr` command.
|
||
|
||
##### Clean the database {.unnumbered}
|
||
|
||
1. filter sequences so that they have a good taxonomic description at
|
||
the species, genus, and family levels
|
||
(`obigrep` command command below).
|
||
2. remove redundant sequences (`obiuniq` command below).
|
||
3. ensure that the dereplicated sequences have a taxid at the family
|
||
level (`obigrep` command below).
|
||
4. ensure that sequences each have a unique identification
|
||
(`obiannotate` command below)
|
||
|
||
``` bash
|
||
> obigrep -d embl_last --require-rank=species \
|
||
--require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
|
||
|
||
> obiuniq -d embl_last \
|
||
v05_clean.fasta > v05_clean_uniq.fasta
|
||
|
||
> obigrep -d embl_last --require-rank=family \
|
||
v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
|
||
|
||
> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
|
||
```
|
||
|
||
obirefidx -t TAXO wolf_data/db_v05_r117.fasta > results/db_v05_r117.indexed.fasta
|
||
|
||
|
||
::: warning
|
||
::: title
|
||
Warning
|
||
:::
|
||
|
||
From now on, for the sake of clarity, the following commands will use
|
||
the filenames of the files provided with the tutorial. If you decided to
|
||
run the last steps and use the files you have produced, you\'ll have to
|
||
use `db_v05.fasta` instead of `db_v05_r117.fasta` and `embl_last`
|
||
instead of `embl_r117`
|
||
:::
|
||
|
||
### Assign each sequence to a taxon
|
||
|
||
Once the reference database is built, taxonomic assignment can be
|
||
carried out using the `obitag` command.
|
||
|
||
```{bash}
|
||
#| output: false
|
||
obitag -t TAXO -R wolf_data/db_v05_r117.indexed.fasta \
|
||
results/wolf.ali.assigned.simple.clean.c10.l80.fasta \
|
||
> results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta
|
||
```
|
||
|
||
The `obitag` adds several attributes in the sequence record header, among
|
||
them:
|
||
|
||
- obitag_bestmatch=ACCESSION where ACCESSION is the id of hte sequence in
|
||
the reference database that best aligns to the query sequence;
|
||
- obitag_bestid=FLOAT where FLOAT\*100 is the percentage of identity
|
||
between the best match sequence and the query sequence;
|
||
- taxid=TAXID where TAXID is the final assignation of the sequence by
|
||
`obitag`
|
||
- scientific_name=NAME where NAME is the scientific name of the
|
||
assigned taxid.
|
||
|
||
The first sequence record of `wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta` is:
|
||
|
||
``` bash
|
||
>HELIUM_000100422_612GNAAXX:7:81:18704:12346#0/1_sub[28..126] {"count":88,"merged_sample":{"26a_F040644":88},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":208},"obitag_bestid":0.9207920792079208,"obitag_bestmatch":"AY769263","obitag_difference":8,"obitag_match_count":1,"obitag_rank":"clade","scientific_name":"Boreoeutheria","taxid":1437010}
|
||
ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
|
||
gcttaaaactcaaaggacttggcggtgctttatatccct
|
||
```
|
||
|
||
### Generate the final result table
|
||
|
||
Some unuseful attributes can be removed at this stage.
|
||
|
||
- obiclean_head
|
||
- obiclean_headcount
|
||
- obiclean_internalcount
|
||
- obiclean_samplecount
|
||
- obiclean_singletoncount
|
||
|
||
```{bash}
|
||
#| output: false
|
||
obiannotate --delete-tag=obiclean_head \
|
||
--delete-tag=obiclean_headcount \
|
||
--delete-tag=obiclean_internalcount \
|
||
--delete-tag=obiclean_samplecount \
|
||
--delete-tag=obiclean_singletoncount \
|
||
results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta \
|
||
> results/wolf.ali.assigned.simple.clean.c10.l80.taxo.ann.fasta
|
||
```
|
||
|
||
The first sequence record of
|
||
`wolf.ali.assigned.simple.c10.l80.clean.taxo.ann.fasta` is then:
|
||
|
||
```
|
||
>HELIUM_000100422_612GNAAXX:7:84:16335:5083#0/1_sub[28..126] {"count":96,"merged_sample":{"26a_F040644":11,"29a_F260619":85},"obiclean_status":{"26a_F040644":"s","29a_F260619":"h"},"obiclean_weight":{"26a_F040644":14,"29a_F260619":110},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_difference":4,"obitag_match_count":1,"obitag_rank":"subspecies","scientific_name":"Canis lupus familiaris","taxid":9615}
|
||
ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
|
||
gattaaacctcaaaggacttggcagtgctttatacccct
|
||
```
|
||
|
||
This file contains 26 sequences. You can deduce the diet of each sample:
|
||
|
||
: - 13a_F730603: Cervus elaphus
|
||
- 15a_F730814: Capreolus capreolus
|
||
- 26a_F040644: Marmota sp. (according to the location, it is
|
||
Marmota marmota)
|
||
- 29a_F260619: Capreolus capreolus
|
||
|
||
Note that we also obtained a few wolf sequences although a wolf-blocking
|
||
oligonucleotide was used.
|