Complement to the documentation

Former-commit-id: 89952a6f3bb261a6aaec24430906e635914ffce4
2026-02-03 14:50:33 +00:00 · 2023-12-04 13:16:34 +01:00
parent eb351a7530
commit 03bef6461d
83 changed files with 65993 additions and 10547 deletions
--- a/doc/book/OBITools-V4.tex
+++ b/doc/book/OBITools-V4.tex
@@ -213,7 +213,7 @@

 \begin{document}
 \maketitle
-\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[frame hidden, borderline west={3pt}{0pt}{shadecolor}, sharp corners, boxrule=0pt, enhanced, interior hidden, breakable]}{\end{tcolorbox}}\fi
+\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[breakable, interior hidden, enhanced, boxrule=0pt, borderline west={3pt}{0pt}{shadecolor}, sharp corners, frame hidden]}{\end{tcolorbox}}\fi

 \renewcommand*\contentsname{Table of contents}
 {
@@ -368,12 +368,12 @@ take into account the taxonomic annotations, ultimately allowing sorting
 and filtering of sequence records based on the taxonomy.

 \hypertarget{installation-of-the-obitools}{%
-\chapter{Installation of the
-obitools}\label{installation-of-the-obitools}}
+\chapter{\texorpdfstring{Installation of the
+\emph{OBITools}}{Installation of the OBITools}}\label{installation-of-the-obitools}}

 \hypertarget{availability-of-the-obitools}{%
-\section{Availability of the
-OBITools}\label{availability-of-the-obitools}}
+\section{\texorpdfstring{Availability of the
+\emph{OBITools}}{Availability of the OBITools}}\label{availability-of-the-obitools}}

 The \emph{OBITools} are open source and protected by the
 \href{http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html}{CeCILL
@@ -389,7 +389,7 @@ downloaded from the metabarcoding git server

 The \emph{OBITools4} are developped using the \href{https://go.dev/}{GO
 programming language}, we stick to the latest version of the language,
-today the \(1.19.5\). If you want to download and compile the sources
+today the \(1.21.4\). If you want to download and compile the sources
 yourself, you first need to install the corresponding compiler on your
 system. Some parts of the soft are also written in C, therefore a recent
 C compiler is also requested, GCC on Linux or Windows, the Developer
@@ -402,6 +402,47 @@ C compiler is available on your system.
 \section{Installation with the install
 script}\label{installation-with-the-install-script}}

+An installation script that compiles the new \emph{OBITools} on your
+Unix-like system is available online. The easiest way to run it is to
+copy and paste the following command into your terminal
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \FunctionTok{bash}
+\end{Highlighting}
+\end{Shaded}
+
+By default, the script installs the \emph{OBITools} commands and other
+associated files into the \texttt{/usr/local} directory. The names of
+the commands in the new \emph{OBITools4} are mostly identical to those
+in \emph{OBITools2}. Therefore, installing the new \emph{OBITools} may
+hide or delete the old ones. If you want both versions to be available
+on your system, the installation script offers two options:
+
+\begin{quote}
+-i, --install-dir Directory where \emph{OBITools} are installed (as
+example use \texttt{/usr/local} not \texttt{/usr/local/bin}).
+
+-p, --obitools-prefix Prefix added to the \emph{OBITools} command names
+if you want to have several versions of obitools at the same time on
+your system (as example \texttt{-p\ g} will produce \texttt{gobigrep}
+command instead of \texttt{obigrep}).
+\end{quote}
+
+You can use these options by following the installation command:
+
+\begin{Shaded}
+\begin{Highlighting}[]
+\ExtensionTok{curl} \AttributeTok{{-}L}\NormalTok{ https://metabarcoding.org/obitools4/install.sh }\KeywordTok{|} \DataTypeTok{\textbackslash{}}
+      \FunctionTok{bash} \AttributeTok{{-}s} \AttributeTok{{-}{-}} \AttributeTok{{-}{-}install{-}dir}\NormalTok{ test\_install }\AttributeTok{{-}{-}obitools{-}prefix}\NormalTok{ k}
+\end{Highlighting}
+\end{Shaded}
+
+In this case, the binaries will be installed in the
+\texttt{test\_install} directory and all command names will be prefixed
+with the letter \texttt{k}. Thus \texttt{obigrep} will be named
+\texttt{kobigrep}.
+
 \hypertarget{compilation-from-sources}{%
 \section{Compilation from sources}\label{compilation-from-sources}}

@@ -409,25 +450,27 @@ script}\label{installation-with-the-install-script}}
 \chapter{\texorpdfstring{File formats usable with
 \emph{OBITools}}{File formats usable with OBITools}}\label{file-formats-usable-with-obitools}}

-OBITools manipulate have to manipulate DNA sequence data and taxonomical
-data. They can use some supplentary metadata describing the experiment
-and produce some stats about the processed DNA data. All the manipulated
-data are stored in text files, following standard data format.
+\emph{OBITools} manipulate have to manipulate DNA sequence data and
+taxonomical data. They can use some supplentary metadata describing the
+experiment and produce some stats about the processed DNA data. All the
+manipulated data are stored in text files, following standard data
+format.

 \hypertarget{the-dna-sequence-data}{%
-\chapter{The DNA sequence data}\label{the-dna-sequence-data}}
+\section{The DNA sequence data}\label{the-dna-sequence-data}}

-Sequences can be stored following various format. OBITools knows some of
-them. The central formats for sequence files manipulated by OBITools
-scripts are the
-\protect\hyperlink{the-fasta-sequence-format}{\texttt{fasta}} and
-\protect\hyperlink{the-fastq-sequence-format}{\texttt{fastq}} format.
-OBITools extends the both these formats by specifying a syntax to
-include in the definition line data qualifying the sequence. All file
-formats use the \texttt{IUPAC} code for encoding nucleotides.
+Sequences can be stored following various format. \emph{OBITools} knows
+some of them. The central formats for sequence files manipulated by
+\emph{OBITools} scripts are the
+\protect\hyperlink{sec-fasta}{\emph{FASTA}} and
+\protect\hyperlink{sec-fastq}{\emph{FASTQ}} format. \emph{OBITools}
+extends the both these formats by specifying a syntax to include in the
+definition line data qualifying the sequence. All file formats use the
+\protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code for encoding
+nucleotides.

 Moreover these two formats that can be used as input and output formats,
-\textbf{OBITools4} can read the following format :
+\emph{OBITools4} can read the following format :

 \begin{itemize}
 \tightlist
@@ -442,11 +485,11 @@ Moreover these two formats that can be used as input and output formats,
  output files}
 \end{itemize}

-\hypertarget{the-iupac-code}{%
-\section{The IUPAC Code}\label{the-iupac-code}}
+\hypertarget{sec-iupac}{%
+\subsection{The IUPAC Code}\label{sec-iupac}}

-The International Union of Pure and Applied Chemistry (IUPAC\_) defined
-the standard code for representing protein or DNA sequences.
+The International Union of Pure and Applied Chemistry (\href{}{IUPAC})
+defined the standard code for representing protein or DNA sequences.

 \begin{longtable}[]{@{}ll@{}}
 \toprule()
@@ -473,22 +516,24 @@ N & Any base (A, C, G, T, or U) \\
 \end{longtable}

 \hypertarget{sec-fasta}{%
-\section{\texorpdfstring{The \emph{fasta} sequence
-format}{The fasta sequence format}}\label{sec-fasta}}
+\subsection{\texorpdfstring{The \emph{FASTA} sequence
+format}{The FASTA sequence format}}\label{sec-fasta}}

-The \textbf{fasta format} is certainly the most widely used sequence
-file format. This is certainly due to its great simplicity. It was
-originally created for the Lipman and Pearson
-\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{FASTA
-program}. OBITools use in more of the classical \texttt{fasta} format an
+The \protect\hyperlink{sec-fasta}{\emph{FASTA}} format is certainly the
+most widely used sequence file format. This is certainly due to its
+great simplicity. It was originally created for the Lipman and Pearson
+\href{http://www.ncbi.nlm.nih.gov/pubmed/3162770?dopt=Citation}{\texttt{FASTA}
+program}. \emph{OBITools} use in more of the classical
+\protect\hyperlink{sec-fasta}{\emph{FASTA}} format an
 \texttt{extended\ version} of this format where structured data are
 included in the title line.

-In \emph{fasta} format a sequence is represented by a title line
-beginning with a \textbf{\texttt{\textgreater{}}} character and the
-sequences by itself following the :doc:\texttt{iupac} code. The sequence
-is usually split other severals lines of the same length (expect for the
-last one)
+In \protect\hyperlink{sec-fasta}{\emph{FASTA}} format a sequence is
+represented by a title line beginning with a
+\textbf{\texttt{\textgreater{}}} character and the sequences by itself
+following the \protect\hyperlink{sec-iupac}{\texttt{IUPAC}} code. The
+sequence is usually split other severals lines of the same length
+(expect for the last one)

 \begin{verbatim}
 >my_sequence this is my pretty sequence
@@ -520,34 +565,45 @@ GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
 AACGACGTTGCAGTACGTTGCAGT
 \end{verbatim}

+\hypertarget{file-extensions}{%
+\subsubsection{File extensions}\label{file-extensions}}
+
+There is no standard file extension for a
+\protect\hyperlink{sec-fasta}{\emph{FASTA}} file, but \texttt{.fa} and
+\texttt{.fasta}, are commonly used.
+
 \hypertarget{sec-fastq}{%
-\section[The \emph{fastq} sequence format]{\texorpdfstring{The
-\emph{fastq} sequence
+\subsection[The \emph{FASTQ} sequence format]{\texorpdfstring{The
+\emph{FASTQ} sequence
 format\footnote{This article uses material from the Wikipedia article
  \href{http://en.wikipedia.org/wiki/FASTQ_format}{\texttt{FASTQ\ format}}
  which is released under the
-  \texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The fastq sequence format}}\label{sec-fastq}}
+  \texttt{Creative\ Commons\ Attribution-Share-Alike\ License\ 3.0}}}{The FASTQ sequence format}}\label{sec-fastq}}

-The \textbf{FASTQ} format is a text file format for storing both
-biological sequences (only nucleic acid sequences) and the associated
-quality scores. The sequence and score are each encoded by a single
-ASCII character. This format was originally developed by the Wellcome
-Trust Sanger Institute to link a
-\protect\hyperlink{the-fasta-sequence-format}{FASTA} sequence file to
-the corresponding quality data, but has recently become the de facto
+The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format is a text file
+format for storing both biological sequences (only nucleic acid
+sequences) and the associated sequence quality scores. Every nucleotide
+of the sequence and its associated quality score are each encoded by a
+single ASCII character. This format was originally developed by the
+Wellcome Trust Sanger Institute to link a
+\protect\hyperlink{sec-fasta}{\emph{FASTA}} sequence file to the
+corresponding quality data, but is now became the \emph{de facto}
 standard for storing results from high-throughput sequencers (Cock et
 al. 2010).

-A fastq file normally uses four lines per sequence.
+\emph{OBITools} considers that a
+\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file uses four lines to
+encode a sequence record.

 \begin{itemize}
 \tightlist
 \item
  Line 1 begins with a `@' character and is followed by a sequence
  identifier and an \emph{optional} description (like a
-  :ref:\texttt{fasta} title line).
+  \protect\hyperlink{sec-fasta}{\emph{FASTA}} title line).
 \item
-  Line 2 is the raw sequence letters.
+  Line 2 is the sequence letters, in upper or lower case, but
+  \emph{OBITools} only write sequences as lower cases.
 \item
  Line 3 begins with a `+' character and is \emph{optionally} followed
  by the same sequence identifier (and any description) again.
@@ -556,7 +612,7 @@ A fastq file normally uses four lines per sequence.
  contain the same number of symbols as letters in the sequence.
 \end{itemize}

-A fastq file containing a single sequence might look like this:
+A \protect\hyperlink{sec-fastq}{\emph{FASTQ}} file looks like this:

 \begin{verbatim}
@SEQ_ID
@@ -575,21 +631,21 @@ characters in left-to-right increasing order of quality
 ^_`abcdefghijklmnopqrstuvwxyz{|}~
 \end{verbatim}

-The original Sanger FASTQ files also allowed the sequence and quality
-strings to be wrapped (split over multiple lines), but this is generally
-discouraged as it can make parsing complicated due to the unfortunate
-choice of ``@'' and ``+'' as markers (these characters can also occur in
-the quality string).
+If the original Sanger \protect\hyperlink{sec-fastq}{\emph{FASTQ}} files
+also allowed the sequence and quality strings to be wrapped (split over
+multiple lines), it is not supported by \emph{OBITools} as it make
+parsing complicated due to the unfortunate choice of ``@'' and ``+'' as
+markers (these characters can also occur in the quality string).

 \hypertarget{sequence-quality-scores}{%
-\subsection*{Sequence quality scores}\label{sequence-quality-scores}}
-\addcontentsline{toc}{subsection}{Sequence quality scores}
+\subsubsection*{Sequence quality scores}\label{sequence-quality-scores}}
+\addcontentsline{toc}{subsubsection}{Sequence quality scores}

 The Phred quality value \emph{Q} is an integer mapping of \emph{p}
-(i.e., the probability that the corresponding base call is incorrect).
-Two different equations have been in use. The first is the standard
-Sanger variant to assess reliability of a base call, otherwise known as
-Phred quality score:
+(\emph{i.e.}, the probability that the corresponding base call is
+incorrect). Two different equations have been in use. The first is the
+standard Sanger variant to assess reliability of a base call, otherwise
+known as Phred quality score:

 \[
 Q_\text{sanger} = -10 \, \log_{10} p
@@ -621,12 +677,12 @@ equivalently, \(Q = 13\).}
 \end{figure}

 \hypertarget{encoding}{%
-\subsubsection*{Encoding}\label{encoding}}
-\addcontentsline{toc}{subsubsection}{Encoding}
+\paragraph*{Encoding}\label{encoding}}
+\addcontentsline{toc}{paragraph}{Encoding}

-The \emph{fastq} format had differente way of encoding the Phred quality
-score along the time. Here a breif history of these changes is
-presented.
+The \protect\hyperlink{sec-fastq}{\emph{FASTQ}} format had differente
+way of encoding the Phred quality score along the time. Here a breif
+history of these changes is presented. \emph{OBITools}

 \begin{itemize}
 \tightlist
@@ -684,15 +740,65 @@ given run.
  to the use of the Sanger format (Phred+33).
 \end{itemize}

-OBItools follows the Sanger format. Nevertheless, It is possible to read
-files encoded following the Solexa/Illumina format by applying a shift
-of 62 (see the option \textbf{-\/-solexa} of the OBITools commands).
+\emph{OBITools} follows the Sanger format. Nevertheless, It is possible
+to read files encoded following the Solexa/Illumina format by applying a
+shift of 62 (see the option \textbf{--solexa} of the \emph{OBITools}
+commands).

-\hypertarget{file-extension}{%
-\section{File extension}\label{file-extension}}
+\hypertarget{file-extensions-1}{%
+\subsubsection{File extensions}\label{file-extensions-1}}

-There is no standard file extension for a FASTQ file, but .fq and
-.fastq, are commonly used.
+There is no standard file extension for a
+\protect\hyperlink{sec-fastq}{\emph{FASTQ}} file, but \texttt{.fq} and
+\texttt{.fastq}, are commonly used.
+
+\hypertarget{the-taxonomy-files}{%
+\section{The taxonomy files}\label{the-taxonomy-files}}
+
+Many OBITools are able to take into account taxonomic data. This is done
+by specifying a directory containing all
+:doc:\texttt{NCBI\ taxonomy\ dump\ files\ \textless{}./taxdump\textgreater{}}.
+
+\hypertarget{the-sample-description-file}{%
+\section{The sample description
+file}\label{the-sample-description-file}}
+
+A key file for \emph{OBITools4} is the file describing all samples (PCR)
+analyzed in the processed sequencing library file. This file, often
+called the \texttt{ngsfilter} file, is a tab separated values (TSV)
+file. The format of this file is exactly identical to that used in
+\emph{OBITools2} and \emph{OBITools4}.
+
+\texttt{\{tsv,\ .smaller\}\ wolf\_diet\ \ \ \ 13a\_F730603\ \ \ \ \ \ aattaac\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 15a\_F730814\ \ \ \ \ \ gaagtag\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 26a\_F040644\ \ \ \ \ \ gaatatc\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F\ wolf\_diet\ \ \ \ 29a\_F260619\ \ \ \ \ \ gcctcct\ \ TTAGATACCCCACTATGC\ \ \ \ TAGAACAGGCTCCTCTAG\ \ \ \ \ F}
+
+At least six columns must be present in every line of the file.
+
+\begin{itemize}
+\item
+  The first column contains the name of the experience:
+
+  An experiment name groups a set of sample together. Sequences
+  belonging to the experiment are tagged with an attribute
+  \texttt{experiment} containing the name of the experiment in their
+  annotation.
+\item
+  The second column contains the sample identifier in the experiment
+
+  The sample identifier must be unique in the experiment. The
+  \texttt{obimultiplex} and \texttt{obitagpcr} commands add to all the
+  sequences bellonging to the same sample an attribute \texttt{sample}
+  containing the sample identifier
+\item
+  The third column contains description of the tag used to identify
+  sequences corresponding to this sample
+\item
+  The fourth column contains the forward primer sequence
+\item
+  The fifth column contains the reverse primer sequence
+\item
+  The sixth column must always contain the character \texttt{F} (full
+  length)
+\end{itemize}

 \hypertarget{obitools-v4-tutorial}{%
 \chapter{OBITools V4 Tutorial}\label{obitools-v4-tutorial}}