Latest version of the doc in HTML

2025-06-29 16:20:46 +00:00 · 2023-01-31 23:04:28 +01:00
parent fa6853fca1
commit 871f22b83d
15 changed files with 1775 additions and 240 deletions
--- a/doc/tutorial.qmd
+++ b/doc/tutorial.qmd
@ -0,0 +1,563 @@
+# OBITools V4 Tutorial
+
+Here is a short tutorial on how to analyze DNA metabarcoding data produced on Illumina sequencers using:
+
+- the OBITools
+- some basic Unix commands
+
+## Wolves’ diet based on DNA metabarcoding
+
+The data used in this tutorial correspond to the analysis of four wolf scats, using the protocol published in @Shehzad2012-pn for assessing carnivore diet. After extracting DNA from the faeces, the DNA amplifications were carried out using the primers `TTAGATACCCCACTATGC` and `TAGAACAGGCTCCTCTAG` amplifiying the *12S-V5* region [@Riaz2011-gn], together with a wolf blocking oligonucleotide.
+
+The complete data set can be downloaded here: [the tutorial dataset](wolf_diet.tgz) 
+
+Once the data file is downloaded, using a UNIX terminal unarchive the data from the `tgz` file.
+
+```{bash untar_data}
+#| output: false
+tar zxvf wolf_diet.tgz
+```
+
+That command create a new directory named `wolf_data` containing every required data files:
+
+-   `fastq <fastq>` files resulting of aGA IIx (Illumina) paired-end (2 x 108 bp) 
+    sequencing assay of DNA extracted and amplified from four wolf faeces:
+
+    -   `wolf_F.fastq`
+    -   `wolf_R.fastq`
+
+-   the file describing the primers and tags used for all samples
+    sequenced:
+
+    -   `wolf_diet_ngsfilter.txt` The tags correspond to short and
+        specific sequences added on the 5\' end of each primer to
+        distinguish the different samples
+
+-   the file containing the reference database in a fasta format:
+
+    -   `db_v05_r117.fasta` This reference database has been extracted
+        from the release 117 of EMBL using `obipcr`
+
+```{bash true_mk_directory}
+#| output: false
+#| echo: false
+#| error: true
+#| 
+if [[ ! -d results ]] ; then
+  mkdir results
+fi
+```
+
+To not mix raw data and processed data a new directory called `results` is created.
+
+```{bash mk_directory}
+#| output: false
+#| eval: false
+mkdir results
+```
+ 
+## Step by step analysis
+
+### Recover full sequence reads from forward and reverse partial reads
+
+When using the result of a paired-end sequencing assay with supposedly
+overlapping forward and reverse reads, the first step is to recover the
+assembled sequence.
+
+The forward and reverse reads of the same fragment are *at the same line
+position* in the two fastq files obtained after sequencing. Based on
+these two files, the assembly of the forward and reverse reads is done
+with the `obipairing` utility that aligns the two reads and returns the
+reconstructed sequence.
+
+In our case, the command is:
+
+```{bash pairing}
+#| output: false
+
+obipairing --min-identity=0.8 \
+           --min-overlap=10 \
+           -F wolf_data/wolf_F.fastq \
+           -R wolf_data/wolf_R.fastq \
+           > results/wolf.fastq 
+```
+
+The `--min-identity` and `--min-overlap` options allow
+discarding sequences with low alignment quality. If after the aligment,
+the overlaping parts of the reads is shorter than 10 base pairs or the 
+similarity over this aligned region is below 80% of identity, in the output file,
+the forward and reverse reads are not aligned but concatenated, and the value of 
+the `mode` attribute in the sequence header is set to `joined` instead of `alignment`.
+
+### Remove unaligned sequence records
+
+Unaligned sequences (:py`mode=joined`{.interpreted-text role="mod"})
+cannot be used. The following command allows removing them from the
+dataset:
+
+```{bash}
+#| output: false
+
+obigrep -p 'annotations.mode != "join"' \
+        results/wolf.fastq > results/wolf.ali.fastq
+```
+
+The `-p` requires a go like expression. `annotations.mode != "join"` means that
+if the value of the `mode` annotation of a sequence is
+different from `join`, the corresponding sequence record will be kept.
+
+The first sequence record of `wolf.ali.fastq` can be obtained using the
+following command line:
+
+```{bash}
+#| eval: false
+#| output: false
+
+head -n 4 results/wolf.ali.fastq
+```
+
+The folling piece of code appears on thew window of tour terminal.
+
+```
+@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1 {"ali_dir":"left","ali_length":62,"mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":62,"(T:34)->(G:18)":48},"score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
+ccgcctcctttagataccccactatgcttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttatacccttctagaggagcctgttctaaggaggcgg
+
+CCCCCCCBCCCCCCCCCCCCCCCCCCCCCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCCCCCCACCCCCACCCCCCCCCCCCCCCC
+```
+
+### Assign each sequence record to the corresponding sample/marker combination
+
+Each sequence record is assigned to its corresponding sample and marker
+using the data provided in a text file (here `wolf_diet_ngsfilter.txt`).
+This text file contains one line per sample, with the name of the
+experiment (several experiments can be included in the same file), the
+name of the tags (for example: `aattaac` if the same tag has been used
+on each extremity of the PCR products, or `aattaac:gaagtag` if the tags
+were different), the sequence of the forward primer, the sequence of the
+reverse primer, the letter `T` or `F` for sample identification using
+the forward primer and tag only or using both primers and both tags,
+respectively (see `obimultiplex` for details).
+
+```{bash}
+#| output: false
+
+obimultiplex -t wolf_data/wolf_diet_ngsfilter.txt \
+             -u results/unidentified.fastq \
+             results/wolf.ali.fastq \
+             > results/wolf.ali.assigned.fastq
+```
+
+This command creates two files:
+
+-   `unidentified.fastq` containing all the sequence records that were
+    not assigned to a sample/marker combination
+-   `wolf.ali.assigned.fastq` containing all the sequence records that
+    were properly assigned to a sample/marker combination
+
+Note that each sequence record of the `wolf.ali.assigned.fastq` file
+contains only the barcode sequence as the sequences of primers and tags
+are removed by the `obimultiplex ` program. Information concerning the 
+experiment, sample, primers and tags is added as attributes in the 
+sequence header.
+
+For instance, the first sequence record of `wolf.ali.assigned.fastq` is:
+
+```
+@HELIUM_000100422_612GNAAXX:7:108:5640:3823#0/1_sub[28..127] {"ali_dir":"left","ali_length":62,"direction":"direct","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gcctcct","mode":"alignment","pairing_mismatches":{"(T:26)->(G:13)":35,"(T:34)->(G:18)":21},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gcctcct","sample":"29a_F260619","score":484,"score_norm":0.968,"seq_a_single":46,"seq_ab_match":60,"seq_b_single":46}
+ttagccctaaacacaagtaattaatataacaaaattgttcgccagagtactaccggcaatagcttaaaactcaaaggacttggcggtgctttataccctt
+
+CCCBCCCCCBCCCCCCC<CcCccbe[`F`accXV<TA\RYU\\ee_e[XZ[XEEEEEEEEEE?EEEEEEEEEEDEEEEEEECCCCCCCCCCCCCCCCCCC
+```
+
+### Dereplicate reads into uniq sequences
+
+The same DNA molecule can be sequenced several times. In order to reduce
+both file size and computations time, and to get easier interpretable
+results, it is convenient to work with unique *sequences* instead of
+*reads*. To *dereplicate* such *reads* into unique *sequences*, we use
+the `obiuniq` command.
+
+-------------------------------------------------------------+
+| Definition: Dereplicate reads into unique sequences         |
+-------------------------------------------------------------+
+| 1.  compare all the reads in a data set to each other       |
+| 2.  group strictly identical reads together                 |
+| 3.  output the sequence for each group and its count in the |
+|     original dataset (in this way, all duplicated reads are |
+|     removed)                                                |
+|                                                             |
+| Definition adapted from @Seguritan2001-tg                   |
+-------------------------------------------------------------+
+
+For dereplication, we use the `obiuniq ` command with the `-m sample`. The `-m sample` option is used
+to keep the information of the samples of origin for each uniquesequence.
+
+```{bash}
+#| output: false
+
+obiuniq -m sample \
+        results/wolf.ali.assigned.fastq \
+        > results/wolf.ali.assigned.uniq.fasta
+```
+
+Note that `obiuniq` returns a fasta file.
+
+The first sequence record of `wolf.ali.assigned.uniq.fasta` is:
+
+```
+>HELIUM_000100422_612GNAAXX:7:93:6991:1942#0/1_sub[28..126] {"ali_dir":"left","ali_length":63,"count":1,"direction":"reverse","experiment":"wolf_diet","forward_match":"ttagataccccactatgc","forward_mismatches":0,"forward_primer":"ttagataccccactatgc","forward_tag":"gaatatc","merged_sample":{"26a_F040644":1},"mode":"alignment","pairing_mismatches":{"(A:10)->(G:34)":76,"(C:06)->(A:34)":58},"reverse_match":"tagaacaggctcctctag","reverse_mismatches":0,"reverse_primer":"tagaacaggctcctctag","reverse_tag":"gaatatc","score":730,"score_norm":0.968,"seq_a_single":45,"seq_ab_match":61,"seq_b_single":45}
+ttagccctaaacataaacattcaataaacaagaatgttcgccagagaactactagcaaca
+gcctgaaactcaaaggacttggcggtgctttatatccct
+```
+
+The run of `obiuniq` has
+added two key=values entries in the header of the fasta sequence:
+
+-   `"merged_sample":{"29a_F260619":1}`{.interpreted-text
+    role="mod"}: this sequence have been found once in a single sample
+    called **29a_F260619**
+-   `"count":1` : the total count for this sequence is $1$
+
+To keep only these two attributes, we can use the `obiannotate` command:
+
+```{bash}
+#| output: false
+
+obiannotate -k count -k merged_sample \
+  results/wolf.ali.assigned.uniq.fasta \
+  > results/wolf.ali.assigned.simple.fasta
+```
+
+The first five sequence records of `wolf.ali.assigned.simple.fasta`
+become:
+
+```
+>HELIUM_000100422_612GNAAXX:7:26:18930:11105#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
+ttagccctaaacacaagtaattaatataacaaaatwattcgcyagagtactacmggcaat
+agctyaaarctcamagrwcttggcggtgctttataccctt
+>HELIUM_000100422_612GNAAXX:7:58:5711:11399#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
+ttagccctaaacacaagtaattaatataacaaaattattcgccagagtwctaccgssaat
+agcttaaaactcaaaggactgggcggtgctttataccctt
+>HELIUM_000100422_612GNAAXX:7:100:15836:9304#0/1_sub[28..127] {"count":1,"merged_sample":{"29a_F260619":1}}
+ttagccctaaacatagataattacacaaacaaaattgttcaccagagtactagcggcaac
+agcttaaaactcaaaggacttggcggtgctttataccctt
+>HELIUM_000100422_612GNAAXX:7:55:13242:9085#0/1_sub[28..126] {"count":4,"merged_sample":{"26a_F040644":4}}
+ttagccctaaacataaacattcaataaacaagagtgttcgccagagtactactagcaaca
+gcctgaaactcaaaggacttggcggtgctttacatccct
+>HELIUM_000100422_612GNAAXX:7:86:8429:13723#0/1_sub[28..127] {"count":7,"merged_sample":{"15a_F730814":5,"29a_F260619":2}}
+ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
+agcttaaaactcaaaggactcggcggtgctttataccctt
+```
+
+### Denoise the sequence dataset
+
+To have a set of sequences assigned to their corresponding samples does
+not mean that all sequences are *biologically* meaningful i.e. some of
+these sequences can contains PCR and/or sequencing errors, or chimeras.
+
+#### Tag the sequences for PCR errors (sequence variants) {.unnumbered}
+
+The `obiclean` program tags sequence variants as potential error generated during
+PCR amplification. We ask it to keep the [head]{.title-ref} sequences (`-H` option) 
+that are sequences which are not variants of another sequence with a count greater than 5% of their own count
+(`-r 0.05` option).
+
+```{bash}
+#| output: false
+
+obiclean -s sample -r 0.05 -H \
+  results/wolf.ali.assigned.simple.fasta \
+      > results/wolf.ali.assigned.simple.clean.fasta 
+```
+
+One of the sequence records of
+`wolf.ali.assigned.simple.clean.fasta` is:
+
+```
+>HELIUM_000100422_612GNAAXX:7:66:4039:8016#0/1_sub[28..127] {"count":17,"merged_sample":{"13a_F730603":17},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obi
+clean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"13a_F730603":"h"},"obiclean_weight":{"13a_F730603":25}}
+ctagccttaaacacaaatagttatgcaaacaaaactattcgccagagtactaccggcaac
+agcccaaaactcaaaggacttggcggtgcttcacaccctt
+```
+
+To remove such sequences as much as possible, we first discard rare
+sequences and then rsequence variants that likely correspond to
+artifacts.
+
+
+
+#### Get some statistics about sequence counts  {.unnumbered}
+
+```{bash}
+obicount results/wolf.ali.assigned.simple.clean.fasta
+```
+
+The dataset contains $4313$ sequences variant corresponding to 42452 sequence reads.
+Most of the variants occur only a single time in the complete dataset and are usualy
+named *singletons*
+
+```{bash}
+obigrep -p 'sequence.Count() == 1' results/wolf.ali.assigned.simple.clean.fasta \
+    | obicount
+```
+
+In that dataset sigletons corresponds to $3511$ variants.
+
+Using *R* and the `ROBIFastread` package able to read headers of the fasta files produced by *OBITools*,
+we can get more complete statistics on the distribution of occurrencies.
+ 
+```{r}
+#| warning: false
+library(ROBIFastread)
+library(ggplot2)
+
+seqs <- read_obifasta("results/wolf.ali.assigned.simple.clean.fasta",keys="count")
+
+ggplot(data = seqs,  mapping=aes(x = count)) +
+  geom_histogram(bins=100) +
+  scale_y_sqrt() +
+  scale_x_sqrt() +
+  geom_vline(xintercept = 10, col="red", lty=2) +
+  xlab("number of occurrencies of a variant") 
+```
+
+In a similar way it is also possible to plot the distribution of the sequence length.
+
+```{r}
+#| warning: false
+ggplot(data = seqs,  mapping=aes(x = nchar(sequence))) +
+  geom_histogram() +
+  scale_y_log10() +
+  geom_vline(xintercept = 80, col="red", lty=2) +
+  xlab("sequence lengths in base pair")
+```
+
+
+#### Keep only the sequences having a count greater or equal to 10 and a length shorter than 80 bp  {.unnumbered}
+
+Based on the previous observation, we set the cut-off for keeping
+sequences for further analysis to a count of 10. To do this, we use the
+`obigrep <scripts/obigrep>`{.interpreted-text role="doc"} command. The
+`-p 'count>=10'` option means that the `python` expression
+:py`count>=10`{.interpreted-text role="mod"} must be evaluated to
+:py`True`{.interpreted-text role="mod"} for each sequence to be kept.
+Based on previous knowledge we also remove sequences with a length
+shorter than 80 bp (option -l) as we know that the amplified 12S-V5
+barcode for vertebrates must have a length around 100bp.
+
+```{bash}
+#| output: false
+
+obigrep -l 80 -p 'sequence.Count() >= 10' results/wolf.ali.assigned.simple.clean.fasta \
+    > results/wolf.ali.assigned.simple.clean.c10.l80.fasta
+```
+
+The first sequence record of `results/wolf.ali.assigned.simple.clean.c10.l80.fasta` is:
+ 
+```
+>HELIUM_000100422_612GNAAXX:7:22:2603:18023#0/1_sub[28..127] {"count":12182,"merged_sample":{"15a_F730814":7559,"29a_F260619":4623},"obiclean_head":true,"obiclean_headcount":2,"obiclean_internalcount":0,"obiclean_samplecount":2,"obiclean_singletoncount":0,"obiclean_status":{"15a_F730814":"h","29a_F260619":"h"},"obiclean_weight":{"15a_F730814":9165,"29a_F260619":6275}}
+ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
+agcttaaaactcaaaggacttggcggtgctttataccctt
+```
+
+At that time in the data cleanning we have conserved : 
+
+```{bash}
+obicount results/wolf.ali.assigned.simple.clean.c10.l80.fasta
+```
+
+### Taxonomic assignment of sequences
+
+Once denoising has been done, the next step in diet analysis is to
+assign the barcodes to the corresponding species in order to get the
+complete list of species associated to each sample.
+
+Taxonomic assignment of sequences requires a reference database
+compiling all possible species to be identified in the sample.
+Assignment is then done based on sequence comparison between sample
+sequences and reference sequences.
+
+#### Download the taxonomy  {.unnumbered}
+
+```{bash}
+#| output: false
+mkdir TAXO
+cd TAXO
+curl http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \
+   | tar -zxvf -
+cd ..
+```
+ 
+#### Build a reference database  {.unnumbered}
+
+One way to build the reference database is to use the
+`ecoPCR <scripts/ecoPCR>`{.interpreted-text role="doc"} program to
+simulate a PCR and to extract all sequences from the EMBL that may be
+amplified [in silico]{.title-ref} by the two primers
+([TTAGATACCCCACTATGC]{.title-ref} and [TAGAACAGGCTCCTCTAG]{.title-ref})
+used for PCR amplification.
+
+The full list of steps for building this reference database would then
+be:
+
+1.  Download the whole set of EMBL sequences (available from:
+    <ftp://ftp.ebi.ac.uk/pub/databases/embl/release/>)
+2.  Download the NCBI taxonomy (available from:
+    <ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz>)
+3.  Format them into the ecoPCR format (see
+    `obiconvert <scripts/obiconvert>`{.interpreted-text role="doc"} for
+    how you can produce ecoPCR compatible files)
+4.  Use ecoPCR to simulate amplification and build a reference database
+    based on putatively amplified barcodes together with their recorded
+    taxonomic information
+
+As step 1 and step 3 can be really time-consuming (about one day), we
+alredy provide the reference database produced by the following commands
+so that you can skip its construction. Note that as the EMBL database
+and taxonomic data can evolve daily, if you run the following commands
+you may end up with quite different results.
+
+Any utility allowing file downloading from a ftp site can be used. In
+the following commands, we use the commonly used `wget` *Unix* command.
+
+##### Download the sequences  {.unnumbered}
+
+``` bash
+> mkdir EMBL
+> cd EMBL
+> wget -nH --cut-dirs=4 -Arel_std_\*.dat.gz -m ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
+> cd ..
+```
+
+##### Download the taxonomy  {.unnumbered}
+
+``` bash
+> mkdir TAXO
+> cd TAXO
+> wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
+> tar -zxvf taxdump.tar.gz
+> cd ..
+```
+
+
+##### Use obipcr to simulate an in silico\` PCR  {.unnumbered}
+
+``` bash
+> obipcr -d ./ECODB/embl_last -e 3 -l 50 -L 150 \ 
+  TTAGATACCCCACTATGC TAGAACAGGCTCCTCTAG > v05.ecopcr
+```
+
+Note that the primers must be in the same order both in
+`wolf_diet_ngsfilter.txt` and in the `obipcr` command.
+
+##### Clean the database  {.unnumbered}
+
+1.  filter sequences so that they have a good taxonomic description at
+    the species, genus, and family levels
+    (`obigrep` command command below).
+2.  remove redundant sequences  (`obiuniq` command below).
+3.  ensure that the dereplicated sequences have a taxid at the family
+    level (`obigrep` command below).
+4.  ensure that sequences each have a unique identification
+    (`obiannotate` command below)
+
+``` bash
+> obigrep -d embl_last --require-rank=species \
+  --require-rank=genus --require-rank=family v05.ecopcr > v05_clean.fasta
+
+> obiuniq -d embl_last \ 
+  v05_clean.fasta > v05_clean_uniq.fasta
+
+> obigrep -d embl_last --require-rank=family \ 
+  v05_clean_uniq.fasta > v05_clean_uniq_clean.fasta
+
+> obiannotate --uniq-id v05_clean_uniq_clean.fasta > db_v05.fasta
+```
+
+obirefidx -t TAXO wolf_data/db_v05_r117.fasta > results/db_v05_r117.indexed.fasta
+
+
+::: warning
+::: title
+Warning
+:::
+
+From now on, for the sake of clarity, the following commands will use
+the filenames of the files provided with the tutorial. If you decided to
+run the last steps and use the files you have produced, you\'ll have to
+use `db_v05.fasta` instead of `db_v05_r117.fasta` and `embl_last`
+instead of `embl_r117`
+:::
+
+### Assign each sequence to a taxon
+
+Once the reference database is built, taxonomic assignment can be
+carried out using the `obitag` command.
+
+```{bash}
+#| output: false
+obitag -t TAXO -R wolf_data/db_v05_r117.indexed.fasta \
+       results/wolf.ali.assigned.simple.clean.c10.l80.fasta \
+       > results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta
+```
+
+The `obitag` adds several attributes in the sequence record header, among
+them:
+
+-   obitag_bestmatch=ACCESSION where ACCESSION is the id of hte sequence in
+    the reference database that best aligns to the query sequence;
+-   obitag_bestid=FLOAT where FLOAT\*100 is the percentage of identity
+    between the best match sequence and the query sequence;
+-   taxid=TAXID where TAXID is the final assignation of the sequence by
+    `obitag`
+-   scientific_name=NAME where NAME is the scientific name of the
+    assigned taxid.
+
+The first sequence record of `wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta` is:
+
+``` bash
+>HELIUM_000100422_612GNAAXX:7:81:18704:12346#0/1_sub[28..126] {"count":88,"merged_sample":{"26a_F040644":88},"obiclean_head":true,"obiclean_headcount":1,"obiclean_internalcount":0,"obiclean_samplecount":1,"obiclean_singletoncount":0,"obiclean_status":{"26a_F040644":"h"},"obiclean_weight":{"26a_F040644":208},"obitag_bestid":0.9207920792079208,"obitag_bestmatch":"AY769263","obitag_difference":8,"obitag_match_count":1,"obitag_rank":"clade","scientific_name":"Boreoeutheria","taxid":1437010}
+ttagccctaaacataaacattcaataaacaagaatgttcgccagaggactactagcaata
+gcttaaaactcaaaggacttggcggtgctttatatccct
+```
+
+### Generate the final result table
+
+Some unuseful attributes can be removed at this stage.
+
+- obiclean_head
+- obiclean_headcount
+- obiclean_internalcount
+- obiclean_samplecount
+- obiclean_singletoncount
+
+```{bash}
+#| output: false
+obiannotate  --delete-tag=obiclean_head \
+             --delete-tag=obiclean_headcount \
+             --delete-tag=obiclean_internalcount \
+             --delete-tag=obiclean_samplecount \
+             --delete-tag=obiclean_singletoncount \
+  results/wolf.ali.assigned.simple.clean.c10.l80.taxo.fasta \
+  > results/wolf.ali.assigned.simple.clean.c10.l80.taxo.ann.fasta
+```
+
+The first sequence record of
+`wolf.ali.assigned.simple.c10.l80.clean.taxo.ann.fasta` is then:
+ 
+``` 
+>HELIUM_000100422_612GNAAXX:7:84:16335:5083#0/1_sub[28..126] {"count":96,"merged_sample":{"26a_F040644":11,"29a_F260619":85},"obiclean_status":{"26a_F040644":"s","29a_F260619":"h"},"obiclean_weight":{"26a_F040644":14,"29a_F260619":110},"obitag_bestid":0.9595959595959596,"obitag_bestmatch":"AC187326","obitag_difference":4,"obitag_match_count":1,"obitag_rank":"subspecies","scientific_name":"Canis lupus familiaris","taxid":9615}
+ttagccctaaacataagctattccataacaaaataattcgccagagaactactagcaaca
+gattaaacctcaaaggacttggcagtgctttatacccct
+```
+
+This file contains 26 sequences. You can deduce the diet of each sample:
+
+:   -   13a_F730603: Cervus elaphus
+    -   15a_F730814: Capreolus capreolus
+    -   26a_F040644: Marmota sp. (according to the location, it is
+        Marmota marmota)
+    -   29a_F260619: Capreolus capreolus
+
+Note that we also obtained a few wolf sequences although a wolf-blocking
+oligonucleotide was used.