Complement to the documentation

Former-commit-id: 89952a6f3bb261a6aaec24430906e635914ffce4
This commit is contained in:
2023-12-04 13:16:34 +01:00
parent eb351a7530
commit 03bef6461d
83 changed files with 65993 additions and 10547 deletions

View File

@@ -213,7 +213,7 @@
\begin{document}
\maketitle
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[frame hidden, borderline west={3pt}{0pt}{shadecolor}, sharp corners, boxrule=0pt, enhanced, interior hidden, breakable]}{\end{tcolorbox}}\fi
\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[breakable, interior hidden, enhanced, boxrule=0pt, borderline west={3pt}{0pt}{shadecolor}, sharp corners, frame hidden]}{\end{tcolorbox}}\fi
\renewcommand*\contentsname{Table of contents}
{
@@ -368,12 +368,12 @@ take into account the taxonomic annotations, ultimately allowing sorting
and filtering of sequence records based on the taxonomy.
\hypertarget{installation-of-the-obitools}{%
\chapter{Installation of the
obitools}\label{installation-of-the-obitools}}
\chapter{\texorpdfstring{Installation of the
\emph{OBITools}}{Installation of the OBITools}}\label{installation-of-the-obitools}}
\hypertarget{availability-of-the-obitools}{%
\section{Availability of the
OBITools}\label{availability-of-the-obitools}}
\section{\texorpdfstring{Availability of the
\emph{OBITools}}{Availability of the OBITools}}\label{availability-of-the-obitools}}
The \emph{OBITools} are open source and protected by the
\href{http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html}{CeCILL
@@ -389,7 +389,7 @@ downloaded from the metabarcoding git server
The \emph{OBITools4} are developped using the \href{https://go.dev/}{GO
programming language}, we stick to the latest version of the language,
today the \(1.19.5\). If you want to download and compile the sources
today the \(1.21.4\). If you want to download and compile the sources
yourself, you first need to install the corresponding compiler on your
system. Some parts of the soft are also written in C, therefore a recent
C compiler is also requested, GCC on Linux or Windows, the Developer
@@ -402,6 +402,47 @@ C compiler is available on your system.
\section{Installation with the install
script}\label{installation-with-the-install-script}}
An installation script that compiles the new \emph{OBITools} on your
Unix-like system is available online. The easiest way to run it is to
copy and paste the following command into your terminal
\begin{Shaded}
\begin{Highlighting}[]
\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \FunctionTok{bash}
\end{Highlighting}
\end{Shaded}
By default, the script installs the \emph{OBITools} commands and other
associated files into the \texttt{/usr/local} directory. The names of
the commands in the new \emph{OBITools4} are mostly identical to those
in \emph{OBITools2}. Therefore, installing the new \emph{OBITools} may
hide or delete the old ones. If you want both versions to be available
on your system, the installation script offers two options:
\begin{quote}
-i, --install-dir Directory where \emph{OBITools} are installed (as
example use \texttt{/usr/local} not \texttt{/usr/local/bin}).
-p, --obitools-prefix Prefix added to the \emph{OBITools} command names
if you want to have several versions of obitools at the same time on
your system (as example \texttt{-p\ g} will produce \texttt{gobigrep}
command instead of \texttt{obigrep}).
\end{quote}
You can use these options by following the installation command:
\begin{Shaded}
\begin{Highlighting}[]
\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \DataTypeTok{\textbackslash{}}
\FunctionTok{bash} \AttributeTok{{-}s} \AttributeTok{{-}{-}} \AttributeTok{{-}{-}install{-}dir}\NormalTok{ test\_install }\AttributeTok{{-}{-}obitools{-}prefix}\NormalTok{ k}
\end{Highlighting}
\end{Shaded}
In this case, the binaries will be installed in the
\texttt{test\_install} directory and all command names will be prefixed
with the letter \texttt{k}. Thus \texttt{obigrep} will be named
\texttt{kobigrep}.
\hypertarget{compilation-from-sources}{%
\section{Compilation from sources}\label{compilation-from-sources}}
@@ -409,25 +450,27 @@ script}\label{installation-with-the-install-script}}
\chapter{\texorpdfstring{File formats usable with
\emph{OBITools}}{File formats usable with OBITools}}\label{file-formats-usable-with-obitools}}
OBITools manipulate have to manipulate DNA sequence data and taxonomical
data. They can use some supplentary metadata describing the experiment
and produce some stats about the processed DNA data. All the manipulated
data are stored in text files, following standard data format.
\emph{OBITools} manipulate have to manipulate DNA sequence data and
taxonomical data. They can use some supplentary metadata describing the
experiment and produce some stats about the processed DNA data. All the
manipulated data are stored in text files, following standard data
format.
\hypertarget{the-dna-sequence-data}{%
\chapter{The DNA sequence data}\label{the-dna-sequence-data}}
\section{The DNA sequence data}\label{the-dna-sequence-data}}
Sequences can be stored following various format. OBITools knows some of
them. The central formats for sequence files manipulated by OBITools
scripts are the
\protect\hyperlink{the-fasta-sequence-format}{\texttt{fasta}} and
\protect\hyperlink{the-fastq-sequence-format}{\texttt{fastq}} format.
OBITools extends the both these formats by specifying a syntax to
include in the definition line data qualifying the sequence. All file
formats use the \texttt{IUPAC} code for encoding nucleotides.
Sequences can be stored following various format. \emph{OBITools} knows
some of them. The central formats for sequence files manipulated by
\emph{OBITools} scripts are the
\protect\hyperlink{sec-fasta}{\emph{FASTA}} and
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} format. \emph{OBITools}
extends the both these formats by specifying a syntax to include in the
definition line data qualifying the sequence. All file formats use the
\protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code for encoding
nucleotides.
Moreover these two formats that can be used as input and output formats,
\textbf{OBITools4} can read the following format :
\emph{OBITools4} can read the following format :
\begin{itemize}
\tightlist
@@ -442,11 +485,11 @@ Moreover these two formats that can be used as input and output formats,
output files}
\end{itemize}
\hypertarget{the-iupac-code}{%
\section{The IUPAC Code}\label{the-iupac-code}}
\hypertarget{sec-iupac}{%
\subsection{The IUPAC Code}\label{sec-iupac}}
The International Union of Pure and Applied Chemistry (IUPAC\_) defined
the standard code for representing protein or DNA sequences.
The International Union of Pure and Applied Chemistry (\href{}{IUPAC})
defined the standard code for representing protein or DNA sequences.
\begin{longtable}[]{@{}ll@{}}
\toprule()
@@ -473,22 +516,24 @@ N & Any base (A, C, G, T, or U) \\
\end{longtable}
\hypertarget{sec-fasta}{%
\section{\texorpdfstring{The \emph{fasta} sequence
format}{The fasta sequence format}}\label{sec-fasta}}
\subsection{\texorpdfstring{The \emph{FASTA} sequence
format}{The FASTA sequence format}}\label{sec-fasta}}
The \textbf{fasta format} is certainly the most widely used sequence
file format. This is certainly due to its great simplicity. It was
originally created for the Lipman and Pearson
\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{FASTA
program}. OBITools use in more of the classical \texttt{fasta} format an
The \protect\hyperlink{sec-fasta}{\emph{FASTA}} format is certainly the
most widely used sequence file format. This is certainly due to its
great simplicity. It was originally created for the Lipman and Pearson
\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{\texttt{FASTA}
program}. \emph{OBITools} use in more of the classical
\protect\hyperlink{sec-fasta}{\emph{FASTA}} format an
\texttt{extended\ version} of this format where structured data are
included in the title line.
In \emph{fasta} format a sequence is represented by a title line
beginning with a \textbf{\texttt{\textgreater{}}} character and the
sequences by itself following the :doc:\texttt{iupac} code. The sequence
is usually split other severals lines of the same length (expect for the
last one)
In \protect\hyperlink{sec-fasta}{\emph{FASTA}} format a sequence is
represented by a title line beginning with a
\textbf{\texttt{\textgreater{}}} character and the sequences by itself
following the \protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code. The
sequence is usually split other severals lines of the same length
(expect for the last one)
\begin{verbatim}
>my_sequence this is my pretty sequence
@@ -520,34 +565,45 @@ GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
\end{verbatim}
\hypertarget{file-extensions}{%
\subsubsection{File extensions}\label{file-extensions}}
There is no standard file extension for a
\protect\hyperlink{sec-fasta}{\emph{FASTA}} file, but \texttt{.fa} and
\texttt{.fasta}, are commonly used.
\hypertarget{sec-fastq}{%
\section[The \emph{fastq} sequence format]{\texorpdfstring{The
\emph{fastq} sequence
\subsection[The \emph{FASTQ} sequence format]{\texorpdfstring{The
\emph{FASTQ} sequence
format\footnote{This article uses material from the Wikipedia article
\href{http://en.wikipedia.org/wiki/FASTQ_format}{\texttt{FASTQ\ format}}
which is released under the
\texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The fastq sequence format}}\label{sec-fastq}}
\texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The FASTQ sequence format}}\label{sec-fastq}}
The \textbf{FASTQ} format is a text file format for storing both
biological sequences (only nucleic acid sequences) and the associated
quality scores. The sequence and score are each encoded by a single
ASCII character. This format was originally developed by the Wellcome
Trust Sanger Institute to link a
\protect\hyperlink{the-fasta-sequence-format}{FASTA} sequence file to
the corresponding quality data, but has recently become the de facto
The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format is a text file
format for storing both biological sequences (only nucleic acid
sequences) and the associated sequence quality scores. Every nucleotide
of the sequence and its associated quality score are each encoded by a
single ASCII character. This format was originally developed by the
Wellcome Trust Sanger Institute to link a
\protect\hyperlink{sec-fasta}{\emph{FASTA}} sequence file to the
corresponding quality data, but is now became the \emph{de facto}
standard for storing results from high-throughput sequencers (Cock et
al. 2010).
A fastq file normally uses four lines per sequence.
\emph{OBITools} considers that a
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file uses four lines to
encode a sequence record.
\begin{itemize}
\tightlist
\item
Line 1 begins with a `@' character and is followed by a sequence
identifier and an \emph{optional} description (like a
:ref:\texttt{fasta} title line).
\protect\hyperlink{sec-fasta}{\emph{FASTA}} title line).
\item
Line 2 is the raw sequence letters.
Line 2 is the sequence letters, in upper or lower case, but
\emph{OBITools} only write sequences as lower cases.
\item
Line 3 begins with a `+' character and is \emph{optionally} followed
by the same sequence identifier (and any description) again.
@@ -556,7 +612,7 @@ A fastq file normally uses four lines per sequence.
contain the same number of symbols as letters in the sequence.
\end{itemize}
A fastq file containing a single sequence might look like this:
A \protect\hyperlink{sec-fastq}{\emph{FASTQ}} file looks like this:
\begin{verbatim}
@SEQ_ID
@@ -575,21 +631,21 @@ characters in left-to-right increasing order of quality
^_`abcdefghijklmnopqrstuvwxyz{|}~
\end{verbatim}
The original Sanger FASTQ files also allowed the sequence and quality
strings to be wrapped (split over multiple lines), but this is generally
discouraged as it can make parsing complicated due to the unfortunate
choice of ``@'' and ``+'' as markers (these characters can also occur in
the quality string).
If the original Sanger \protect\hyperlink{sec-fastq}{\emph{FASTQ}} files
also allowed the sequence and quality strings to be wrapped (split over
multiple lines), it is not supported by \emph{OBITools} as it make
parsing complicated due to the unfortunate choice of ``@'' and ``+'' as
markers (these characters can also occur in the quality string).
\hypertarget{sequence-quality-scores}{%
\subsection*{Sequence quality scores}\label{sequence-quality-scores}}
\addcontentsline{toc}{subsection}{Sequence quality scores}
\subsubsection*{Sequence quality scores}\label{sequence-quality-scores}}
\addcontentsline{toc}{subsubsection}{Sequence quality scores}
The Phred quality value \emph{Q} is an integer mapping of \emph{p}
(i.e., the probability that the corresponding base call is incorrect).
Two different equations have been in use. The first is the standard
Sanger variant to assess reliability of a base call, otherwise known as
Phred quality score:
(\emph{i.e.}, the probability that the corresponding base call is
incorrect). Two different equations have been in use. The first is the
standard Sanger variant to assess reliability of a base call, otherwise
known as Phred quality score:
\[
Q_\text{sanger} = -10 \, \log_{10} p
@@ -621,12 +677,12 @@ equivalently, \(Q = 13\).}
\end{figure}
\hypertarget{encoding}{%
\subsubsection*{Encoding}\label{encoding}}
\addcontentsline{toc}{subsubsection}{Encoding}
\paragraph*{Encoding}\label{encoding}}
\addcontentsline{toc}{paragraph}{Encoding}
The \emph{fastq} format had differente way of encoding the Phred quality
score along the time. Here a breif history of these changes is
presented.
The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format had differente
way of encoding the Phred quality score along the time. Here a breif
history of these changes is presented. \emph{OBITools}
\begin{itemize}
\tightlist
@@ -684,15 +740,65 @@ given run.
to the use of the Sanger format (Phred+33).
\end{itemize}
OBItools follows the Sanger format. Nevertheless, It is possible to read
files encoded following the Solexa/Illumina format by applying a shift
of 62 (see the option \textbf{-\/-solexa} of the OBITools commands).
\emph{OBITools} follows the Sanger format. Nevertheless, It is possible
to read files encoded following the Solexa/Illumina format by applying a
shift of 62 (see the option \textbf{--solexa} of the \emph{OBITools}
commands).
\hypertarget{file-extension}{%
\section{File extension}\label{file-extension}}
\hypertarget{file-extensions-1}{%
\subsubsection{File extensions}\label{file-extensions-1}}
There is no standard file extension for a FASTQ file, but .fq and
.fastq, are commonly used.
There is no standard file extension for a
\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file, but \texttt{.fq} and
\texttt{.fastq}, are commonly used.
\hypertarget{the-taxonomy-files}{%
\section{The taxonomy files}\label{the-taxonomy-files}}
Many OBITools are able to take into account taxonomic data. This is done
by specifying a directory containing all
:doc:\texttt{NCBI\ taxonomy\ dump\ files\ \textless{}./taxdump\textgreater{}}.
\hypertarget{the-sample-description-file}{%
\section{The sample description
file}\label{the-sample-description-file}}
A key file for \emph{OBITools4} is the file describing all samples (PCR)
analyzed in the processed sequencing library file. This file, often
called the \texttt{ngsfilter} file, is a tab separated values (TSV)
file. The format of this file is exactly identical to that used in
\emph{OBITools2} and \emph{OBITools4}.
\texttt{\{tsv,\ .smaller\}\ wolf\_diet\ \ \ \ 13a\_F730603\ \ \ \ \ \ aattaac\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 15a\_F730814\ \ \ \ \ \ gaagtag\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 26a\_F040644\ \ \ \ \ \ gaatatc\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 29a\_F260619\ \ \ \ \ \ gcctcct\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F}
At least six columns must be present in every line of the file.
\begin{itemize}
\item
The first column contains the name of the experience:
An experiment name groups a set of sample together. Sequences
belonging to the experiment are tagged with an attribute
\texttt{experiment} containing the name of the experiment in their
annotation.
\item
The second column contains the sample identifier in the experiment
The sample identifier must be unique in the experiment. The
\texttt{obimultiplex} and \texttt{obitagpcr} commands add to all the
sequences bellonging to the same sample an attribute \texttt{sample}
containing the sample identifier
\item
The third column contains description of the tag used to identify
sequences corresponding to this sample
\item
The fourth column contains the forward primer sequence
\item
The fifth column contains the reverse primer sequence
\item
The sixth column must always contain the character \texttt{F} (full
length)
\end{itemize}
\hypertarget{obitools-v4-tutorial}{%
\chapter{OBITools V4 Tutorial}\label{obitools-v4-tutorial}}