mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-12-09 01:00:26 +00:00
Complement to the documentation
Former-commit-id: 89952a6f3bb261a6aaec24430906e635914ffce4
This commit is contained in:
@@ -213,7 +213,7 @@
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[frame hidden, borderline west={3pt}{0pt}{shadecolor}, sharp corners, boxrule=0pt, enhanced, interior hidden, breakable]}{\end{tcolorbox}}\fi
|
||||
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[breakable, interior hidden, enhanced, boxrule=0pt, borderline west={3pt}{0pt}{shadecolor}, sharp corners, frame hidden]}{\end{tcolorbox}}\fi
|
||||
|
||||
\renewcommand*\contentsname{Table of contents}
|
||||
{
|
||||
@@ -368,12 +368,12 @@ take into account the taxonomic annotations, ultimately allowing sorting
|
||||
and filtering of sequence records based on the taxonomy.
|
||||
|
||||
\hypertarget{installation-of-the-obitools}{%
|
||||
\chapter{Installation of the
|
||||
obitools}\label{installation-of-the-obitools}}
|
||||
\chapter{\texorpdfstring{Installation of the
|
||||
\emph{OBITools}}{Installation of the OBITools}}\label{installation-of-the-obitools}}
|
||||
|
||||
\hypertarget{availability-of-the-obitools}{%
|
||||
\section{Availability of the
|
||||
OBITools}\label{availability-of-the-obitools}}
|
||||
\section{\texorpdfstring{Availability of the
|
||||
\emph{OBITools}}{Availability of the OBITools}}\label{availability-of-the-obitools}}
|
||||
|
||||
The \emph{OBITools} are open source and protected by the
|
||||
\href{http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html}{CeCILL
|
||||
@@ -389,7 +389,7 @@ downloaded from the metabarcoding git server
|
||||
|
||||
The \emph{OBITools4} are developped using the \href{https://go.dev/}{GO
|
||||
programming language}, we stick to the latest version of the language,
|
||||
today the \(1.19.5\). If you want to download and compile the sources
|
||||
today the \(1.21.4\). If you want to download and compile the sources
|
||||
yourself, you first need to install the corresponding compiler on your
|
||||
system. Some parts of the soft are also written in C, therefore a recent
|
||||
C compiler is also requested, GCC on Linux or Windows, the Developer
|
||||
@@ -402,6 +402,47 @@ C compiler is available on your system.
|
||||
\section{Installation with the install
|
||||
script}\label{installation-with-the-install-script}}
|
||||
|
||||
An installation script that compiles the new \emph{OBITools} on your
|
||||
Unix-like system is available online. The easiest way to run it is to
|
||||
copy and paste the following command into your terminal
|
||||
|
||||
\begin{Shaded}
|
||||
\begin{Highlighting}[]
|
||||
\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \FunctionTok{bash}
|
||||
\end{Highlighting}
|
||||
\end{Shaded}
|
||||
|
||||
By default, the script installs the \emph{OBITools} commands and other
|
||||
associated files into the \texttt{/usr/local} directory. The names of
|
||||
the commands in the new \emph{OBITools4} are mostly identical to those
|
||||
in \emph{OBITools2}. Therefore, installing the new \emph{OBITools} may
|
||||
hide or delete the old ones. If you want both versions to be available
|
||||
on your system, the installation script offers two options:
|
||||
|
||||
\begin{quote}
|
||||
-i, --install-dir Directory where \emph{OBITools} are installed (as
|
||||
example use \texttt{/usr/local} not \texttt{/usr/local/bin}).
|
||||
|
||||
-p, --obitools-prefix Prefix added to the \emph{OBITools} command names
|
||||
if you want to have several versions of obitools at the same time on
|
||||
your system (as example \texttt{-p\ g} will produce \texttt{gobigrep}
|
||||
command instead of \texttt{obigrep}).
|
||||
\end{quote}
|
||||
|
||||
You can use these options by following the installation command:
|
||||
|
||||
\begin{Shaded}
|
||||
\begin{Highlighting}[]
|
||||
\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \DataTypeTok{\textbackslash{}}
|
||||
\FunctionTok{bash} \AttributeTok{{-}s} \AttributeTok{{-}{-}} \AttributeTok{{-}{-}install{-}dir}\NormalTok{ test\_install }\AttributeTok{{-}{-}obitools{-}prefix}\NormalTok{ k}
|
||||
\end{Highlighting}
|
||||
\end{Shaded}
|
||||
|
||||
In this case, the binaries will be installed in the
|
||||
\texttt{test\_install} directory and all command names will be prefixed
|
||||
with the letter \texttt{k}. Thus \texttt{obigrep} will be named
|
||||
\texttt{kobigrep}.
|
||||
|
||||
\hypertarget{compilation-from-sources}{%
|
||||
\section{Compilation from sources}\label{compilation-from-sources}}
|
||||
|
||||
@@ -409,25 +450,27 @@ script}\label{installation-with-the-install-script}}
|
||||
\chapter{\texorpdfstring{File formats usable with
|
||||
\emph{OBITools}}{File formats usable with OBITools}}\label{file-formats-usable-with-obitools}}
|
||||
|
||||
OBITools manipulate have to manipulate DNA sequence data and taxonomical
|
||||
data. They can use some supplentary metadata describing the experiment
|
||||
and produce some stats about the processed DNA data. All the manipulated
|
||||
data are stored in text files, following standard data format.
|
||||
\emph{OBITools} manipulate have to manipulate DNA sequence data and
|
||||
taxonomical data. They can use some supplentary metadata describing the
|
||||
experiment and produce some stats about the processed DNA data. All the
|
||||
manipulated data are stored in text files, following standard data
|
||||
format.
|
||||
|
||||
\hypertarget{the-dna-sequence-data}{%
|
||||
\chapter{The DNA sequence data}\label{the-dna-sequence-data}}
|
||||
\section{The DNA sequence data}\label{the-dna-sequence-data}}
|
||||
|
||||
Sequences can be stored following various format. OBITools knows some of
|
||||
them. The central formats for sequence files manipulated by OBITools
|
||||
scripts are the
|
||||
\protect\hyperlink{the-fasta-sequence-format}{\texttt{fasta}} and
|
||||
\protect\hyperlink{the-fastq-sequence-format}{\texttt{fastq}} format.
|
||||
OBITools extends the both these formats by specifying a syntax to
|
||||
include in the definition line data qualifying the sequence. All file
|
||||
formats use the \texttt{IUPAC} code for encoding nucleotides.
|
||||
Sequences can be stored following various format. \emph{OBITools} knows
|
||||
some of them. The central formats for sequence files manipulated by
|
||||
\emph{OBITools} scripts are the
|
||||
\protect\hyperlink{sec-fasta}{\emph{FASTA}} and
|
||||
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} format. \emph{OBITools}
|
||||
extends the both these formats by specifying a syntax to include in the
|
||||
definition line data qualifying the sequence. All file formats use the
|
||||
\protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code for encoding
|
||||
nucleotides.
|
||||
|
||||
Moreover these two formats that can be used as input and output formats,
|
||||
\textbf{OBITools4} can read the following format :
|
||||
\emph{OBITools4} can read the following format :
|
||||
|
||||
\begin{itemize}
|
||||
\tightlist
|
||||
@@ -442,11 +485,11 @@ Moreover these two formats that can be used as input and output formats,
|
||||
output files}
|
||||
\end{itemize}
|
||||
|
||||
\hypertarget{the-iupac-code}{%
|
||||
\section{The IUPAC Code}\label{the-iupac-code}}
|
||||
\hypertarget{sec-iupac}{%
|
||||
\subsection{The IUPAC Code}\label{sec-iupac}}
|
||||
|
||||
The International Union of Pure and Applied Chemistry (IUPAC\_) defined
|
||||
the standard code for representing protein or DNA sequences.
|
||||
The International Union of Pure and Applied Chemistry (\href{}{IUPAC})
|
||||
defined the standard code for representing protein or DNA sequences.
|
||||
|
||||
\begin{longtable}[]{@{}ll@{}}
|
||||
\toprule()
|
||||
@@ -473,22 +516,24 @@ N & Any base (A, C, G, T, or U) \\
|
||||
\end{longtable}
|
||||
|
||||
\hypertarget{sec-fasta}{%
|
||||
\section{\texorpdfstring{The \emph{fasta} sequence
|
||||
format}{The fasta sequence format}}\label{sec-fasta}}
|
||||
\subsection{\texorpdfstring{The \emph{FASTA} sequence
|
||||
format}{The FASTA sequence format}}\label{sec-fasta}}
|
||||
|
||||
The \textbf{fasta format} is certainly the most widely used sequence
|
||||
file format. This is certainly due to its great simplicity. It was
|
||||
originally created for the Lipman and Pearson
|
||||
\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{FASTA
|
||||
program}. OBITools use in more of the classical \texttt{fasta} format an
|
||||
The \protect\hyperlink{sec-fasta}{\emph{FASTA}} format is certainly the
|
||||
most widely used sequence file format. This is certainly due to its
|
||||
great simplicity. It was originally created for the Lipman and Pearson
|
||||
\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{\texttt{FASTA}
|
||||
program}. \emph{OBITools} use in more of the classical
|
||||
\protect\hyperlink{sec-fasta}{\emph{FASTA}} format an
|
||||
\texttt{extended\ version} of this format where structured data are
|
||||
included in the title line.
|
||||
|
||||
In \emph{fasta} format a sequence is represented by a title line
|
||||
beginning with a \textbf{\texttt{\textgreater{}}} character and the
|
||||
sequences by itself following the :doc:\texttt{iupac} code. The sequence
|
||||
is usually split other severals lines of the same length (expect for the
|
||||
last one)
|
||||
In \protect\hyperlink{sec-fasta}{\emph{FASTA}} format a sequence is
|
||||
represented by a title line beginning with a
|
||||
\textbf{\texttt{\textgreater{}}} character and the sequences by itself
|
||||
following the \protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code. The
|
||||
sequence is usually split other severals lines of the same length
|
||||
(expect for the last one)
|
||||
|
||||
\begin{verbatim}
|
||||
>my_sequence this is my pretty sequence
|
||||
@@ -520,34 +565,45 @@ GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
|
||||
AACGACGTTGCAGTACGTTGCAGT
|
||||
\end{verbatim}
|
||||
|
||||
\hypertarget{file-extensions}{%
|
||||
\subsubsection{File extensions}\label{file-extensions}}
|
||||
|
||||
There is no standard file extension for a
|
||||
\protect\hyperlink{sec-fasta}{\emph{FASTA}} file, but \texttt{.fa} and
|
||||
\texttt{.fasta}, are commonly used.
|
||||
|
||||
\hypertarget{sec-fastq}{%
|
||||
\section[The \emph{fastq} sequence format]{\texorpdfstring{The
|
||||
\emph{fastq} sequence
|
||||
\subsection[The \emph{FASTQ} sequence format]{\texorpdfstring{The
|
||||
\emph{FASTQ} sequence
|
||||
format\footnote{This article uses material from the Wikipedia article
|
||||
\href{http://en.wikipedia.org/wiki/FASTQ_format}{\texttt{FASTQ\ format}}
|
||||
which is released under the
|
||||
\texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The fastq sequence format}}\label{sec-fastq}}
|
||||
\texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The FASTQ sequence format}}\label{sec-fastq}}
|
||||
|
||||
The \textbf{FASTQ} format is a text file format for storing both
|
||||
biological sequences (only nucleic acid sequences) and the associated
|
||||
quality scores. The sequence and score are each encoded by a single
|
||||
ASCII character. This format was originally developed by the Wellcome
|
||||
Trust Sanger Institute to link a
|
||||
\protect\hyperlink{the-fasta-sequence-format}{FASTA} sequence file to
|
||||
the corresponding quality data, but has recently become the de facto
|
||||
The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format is a text file
|
||||
format for storing both biological sequences (only nucleic acid
|
||||
sequences) and the associated sequence quality scores. Every nucleotide
|
||||
of the sequence and its associated quality score are each encoded by a
|
||||
single ASCII character. This format was originally developed by the
|
||||
Wellcome Trust Sanger Institute to link a
|
||||
\protect\hyperlink{sec-fasta}{\emph{FASTA}} sequence file to the
|
||||
corresponding quality data, but is now became the \emph{de facto}
|
||||
standard for storing results from high-throughput sequencers (Cock et
|
||||
al. 2010).
|
||||
|
||||
A fastq file normally uses four lines per sequence.
|
||||
\emph{OBITools} considers that a
|
||||
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file uses four lines to
|
||||
encode a sequence record.
|
||||
|
||||
\begin{itemize}
|
||||
\tightlist
|
||||
\item
|
||||
Line 1 begins with a `@' character and is followed by a sequence
|
||||
identifier and an \emph{optional} description (like a
|
||||
:ref:\texttt{fasta} title line).
|
||||
\protect\hyperlink{sec-fasta}{\emph{FASTA}} title line).
|
||||
\item
|
||||
Line 2 is the raw sequence letters.
|
||||
Line 2 is the sequence letters, in upper or lower case, but
|
||||
\emph{OBITools} only write sequences as lower cases.
|
||||
\item
|
||||
Line 3 begins with a `+' character and is \emph{optionally} followed
|
||||
by the same sequence identifier (and any description) again.
|
||||
@@ -556,7 +612,7 @@ A fastq file normally uses four lines per sequence.
|
||||
contain the same number of symbols as letters in the sequence.
|
||||
\end{itemize}
|
||||
|
||||
A fastq file containing a single sequence might look like this:
|
||||
A \protect\hyperlink{sec-fastq}{\emph{FASTQ}} file looks like this:
|
||||
|
||||
\begin{verbatim}
|
||||
@SEQ_ID
|
||||
@@ -575,21 +631,21 @@ characters in left-to-right increasing order of quality
|
||||
^_`abcdefghijklmnopqrstuvwxyz{|}~
|
||||
\end{verbatim}
|
||||
|
||||
The original Sanger FASTQ files also allowed the sequence and quality
|
||||
strings to be wrapped (split over multiple lines), but this is generally
|
||||
discouraged as it can make parsing complicated due to the unfortunate
|
||||
choice of ``@'' and ``+'' as markers (these characters can also occur in
|
||||
the quality string).
|
||||
If the original Sanger \protect\hyperlink{sec-fastq}{\emph{FASTQ}} files
|
||||
also allowed the sequence and quality strings to be wrapped (split over
|
||||
multiple lines), it is not supported by \emph{OBITools} as it make
|
||||
parsing complicated due to the unfortunate choice of ``@'' and ``+'' as
|
||||
markers (these characters can also occur in the quality string).
|
||||
|
||||
\hypertarget{sequence-quality-scores}{%
|
||||
\subsection*{Sequence quality scores}\label{sequence-quality-scores}}
|
||||
\addcontentsline{toc}{subsection}{Sequence quality scores}
|
||||
\subsubsection*{Sequence quality scores}\label{sequence-quality-scores}}
|
||||
\addcontentsline{toc}{subsubsection}{Sequence quality scores}
|
||||
|
||||
The Phred quality value \emph{Q} is an integer mapping of \emph{p}
|
||||
(i.e., the probability that the corresponding base call is incorrect).
|
||||
Two different equations have been in use. The first is the standard
|
||||
Sanger variant to assess reliability of a base call, otherwise known as
|
||||
Phred quality score:
|
||||
(\emph{i.e.}, the probability that the corresponding base call is
|
||||
incorrect). Two different equations have been in use. The first is the
|
||||
standard Sanger variant to assess reliability of a base call, otherwise
|
||||
known as Phred quality score:
|
||||
|
||||
\[
|
||||
Q_\text{sanger} = -10 \, \log_{10} p
|
||||
@@ -621,12 +677,12 @@ equivalently, \(Q = 13\).}
|
||||
\end{figure}
|
||||
|
||||
\hypertarget{encoding}{%
|
||||
\subsubsection*{Encoding}\label{encoding}}
|
||||
\addcontentsline{toc}{subsubsection}{Encoding}
|
||||
\paragraph*{Encoding}\label{encoding}}
|
||||
\addcontentsline{toc}{paragraph}{Encoding}
|
||||
|
||||
The \emph{fastq} format had differente way of encoding the Phred quality
|
||||
score along the time. Here a breif history of these changes is
|
||||
presented.
|
||||
The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format had differente
|
||||
way of encoding the Phred quality score along the time. Here a breif
|
||||
history of these changes is presented. \emph{OBITools}
|
||||
|
||||
\begin{itemize}
|
||||
\tightlist
|
||||
@@ -684,15 +740,65 @@ given run.
|
||||
to the use of the Sanger format (Phred+33).
|
||||
\end{itemize}
|
||||
|
||||
OBItools follows the Sanger format. Nevertheless, It is possible to read
|
||||
files encoded following the Solexa/Illumina format by applying a shift
|
||||
of 62 (see the option \textbf{-\/-solexa} of the OBITools commands).
|
||||
\emph{OBITools} follows the Sanger format. Nevertheless, It is possible
|
||||
to read files encoded following the Solexa/Illumina format by applying a
|
||||
shift of 62 (see the option \textbf{--solexa} of the \emph{OBITools}
|
||||
commands).
|
||||
|
||||
\hypertarget{file-extension}{%
|
||||
\section{File extension}\label{file-extension}}
|
||||
\hypertarget{file-extensions-1}{%
|
||||
\subsubsection{File extensions}\label{file-extensions-1}}
|
||||
|
||||
There is no standard file extension for a FASTQ file, but .fq and
|
||||
.fastq, are commonly used.
|
||||
There is no standard file extension for a
|
||||
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file, but \texttt{.fq} and
|
||||
\texttt{.fastq}, are commonly used.
|
||||
|
||||
\hypertarget{the-taxonomy-files}{%
|
||||
\section{The taxonomy files}\label{the-taxonomy-files}}
|
||||
|
||||
Many OBITools are able to take into account taxonomic data. This is done
|
||||
by specifying a directory containing all
|
||||
:doc:\texttt{NCBI\ taxonomy\ dump\ files\ \textless{}./taxdump\textgreater{}}.
|
||||
|
||||
\hypertarget{the-sample-description-file}{%
|
||||
\section{The sample description
|
||||
file}\label{the-sample-description-file}}
|
||||
|
||||
A key file for \emph{OBITools4} is the file describing all samples (PCR)
|
||||
analyzed in the processed sequencing library file. This file, often
|
||||
called the \texttt{ngsfilter} file, is a tab separated values (TSV)
|
||||
file. The format of this file is exactly identical to that used in
|
||||
\emph{OBITools2} and \emph{OBITools4}.
|
||||
|
||||
\texttt{\{tsv,\ .smaller\}\ wolf\_diet\ \ \ \ 13a\_F730603\ \ \ \ \ \ aattaac\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 15a\_F730814\ \ \ \ \ \ gaagtag\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 26a\_F040644\ \ \ \ \ \ gaatatc\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 29a\_F260619\ \ \ \ \ \ gcctcct\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F}
|
||||
|
||||
At least six columns must be present in every line of the file.
|
||||
|
||||
\begin{itemize}
|
||||
\item
|
||||
The first column contains the name of the experience:
|
||||
|
||||
An experiment name groups a set of sample together. Sequences
|
||||
belonging to the experiment are tagged with an attribute
|
||||
\texttt{experiment} containing the name of the experiment in their
|
||||
annotation.
|
||||
\item
|
||||
The second column contains the sample identifier in the experiment
|
||||
|
||||
The sample identifier must be unique in the experiment. The
|
||||
\texttt{obimultiplex} and \texttt{obitagpcr} commands add to all the
|
||||
sequences bellonging to the same sample an attribute \texttt{sample}
|
||||
containing the sample identifier
|
||||
\item
|
||||
The third column contains description of the tag used to identify
|
||||
sequences corresponding to this sample
|
||||
\item
|
||||
The fourth column contains the forward primer sequence
|
||||
\item
|
||||
The fifth column contains the reverse primer sequence
|
||||
\item
|
||||
The sixth column must always contain the character \texttt{F} (full
|
||||
length)
|
||||
\end{itemize}
|
||||
|
||||
\hypertarget{obitools-v4-tutorial}{%
|
||||
\chapter{OBITools V4 Tutorial}\label{obitools-v4-tutorial}}
|
||||
|
||||
Reference in New Issue
Block a user