Complement on the doc

This commit is contained in:
2023-01-27 10:49:28 +01:00
parent cfddc78161
commit 39b47a32bf
12 changed files with 485 additions and 29 deletions

View File

@ -105,7 +105,7 @@ ul.task-list li input[type="checkbox"] {
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">

View File

@ -7,7 +7,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title>OBITools V4 - 2&nbsp; The OBITools commands</title>
<title>OBITools V4 - 2&nbsp; The OBITools V4 commands</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
@ -20,6 +20,25 @@ ul.task-list li input[type="checkbox"] {
margin: 0 0.8em 0.2em -1.6em;
vertical-align: middle;
}
div.csl-bib-body { }
div.csl-entry {
clear: both;
}
.hanging div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
}
</style>
@ -61,6 +80,7 @@ ul.task-list li input[type="checkbox"] {
}
}</script>
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>
</head>
@ -70,7 +90,7 @@ ul.task-list li input[type="checkbox"] {
<header id="quarto-header" class="headroom fixed-top">
<nav class="quarto-secondary-nav" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">
<div class="container-fluid d-flex justify-content-between">
<h1 class="quarto-secondary-nav-title"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></h1>
<h1 class="quarto-secondary-nav-title"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></h1>
<button type="button" class="quarto-btn-toggle btn" aria-label="Show secondary navigation">
<i class="bi bi-chevron-right"></i>
</button>
@ -105,7 +125,7 @@ ul.task-list li input[type="checkbox"] {
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link active"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link active"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">
@ -164,7 +184,7 @@ ul.task-list li input[type="checkbox"] {
<header id="title-block-header" class="quarto-title-block default">
<div class="quarto-title">
<h1 class="title d-none d-lg-block"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></h1>
<h1 class="title d-none d-lg-block"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></h1>
</div>
@ -296,20 +316,93 @@ ul.task-list li input[type="checkbox"] {
<blockquote class="blockquote">
<p>Replace the <code>illuminapairedends</code> original <em>OBITools</em></p>
</blockquote>
<section id="obimultiplex" class="level4" data-number="2.7.1.1">
<h4 data-number="2.7.1.1" class="anchored" data-anchor-id="obimultiplex"><span class="header-section-number">2.7.1.1</span> <code>obimultiplex</code></h4>
<section id="alignment-procedure" class="level4" data-number="2.7.1.1">
<h4 data-number="2.7.1.1" class="anchored" data-anchor-id="alignment-procedure"><span class="header-section-number">2.7.1.1</span> Alignment procedure</h4>
<p><code>obipairing</code> is introducing a new alignment algorithm compared to the <code>illuminapairedend</code> command of the <code>OBITools V2</code>. Nethertheless this new algorithm has been design to produce the same results than the previous, except in very few cases.</p>
<p>The new algorithm is a two-step procedure. First, a FASTN-type algorithm <span class="citation" data-cites="Lipman1985-hw">(<a href="references.html#ref-Lipman1985-hw" role="doc-biblioref">Lipman and Pearson 1985</a>)</span> identifies the best offset between the two matched readings. This identifies the region of overlap.</p>
<p>In the second step, the matching regions of the two reads are extracted along with a flanking sequence of <span class="math inline">\(\Delta\)</span> base pairs. The two subsequences are then aligned using a “one side free end-gap” dynamic programming algorithm. This latter step is only called if at least one mismatch is detected by the FASTP step.</p>
<p>Unless the similarity between the two reads at their overlap region is very low, the addition of the flanking regions in the second step of the alignment ensures the same alignment as if the dynamic programming alignment was performed on the full reads.</p>
</section>
<section id="the-scoring-system" class="level4" data-number="2.7.1.2">
<h4 data-number="2.7.1.2" class="anchored" data-anchor-id="the-scoring-system"><span class="header-section-number">2.7.1.2</span> The scoring system</h4>
<p>In the dynamic programming step, the match and mismatch scores take into account the quality scores of the two aligned nucleotides. By taking these into account, the probability of a true match can be calculated for each aligned base pair.</p>
<p>If we consider a nucleotide read with a quality score <span class="math inline">\(Q\)</span>, the probability of misreading this base (<span class="math inline">\(P_E\)</span>) is : <span class="math display">\[
P_E = 10^{-\frac{Q}{10}}
\]</span></p>
<p>Thus, when a given nucleotide <span class="math inline">\(X\)</span> is observed with the quality score <span class="math inline">\(Q\)</span>. The probability that <span class="math inline">\(X\)</span> is really an <span class="math inline">\(X\)</span> is :</p>
<p><span class="math display">\[
P(X=X) = 1 - P_E
\]</span></p>
<p>Otherwise, <span class="math inline">\(X\)</span> is actually one of the three other possible nucleotides (<span class="math inline">\(X_{E1}\)</span>, <span class="math inline">\(X_{E2}\)</span> or <span class="math inline">\(X_{E3}\)</span>). If we suppose that the three reading error have the same probability :</p>
<p><span class="math display">\[
P(X=X_{E1}) = P(X=X_{E3}) = P(X=X_{E3}) = \frac{P_E}{3}
\]</span></p>
<p>At each position in an alignment where the two nucleotides <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> face each other (not a gapped position), the probability of a true match varies depending on whether <span class="math inline">\(X_1=X_2\)</span>, an observed match, or <span class="math inline">\(X_1 \neq X_2\)</span>, an observed mismatch.</p>
<p><strong>Probability of a true match when <span class="math inline">\(X_1=X_2\)</span></strong></p>
<p>That probability can be divided in two parts. First <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> have been correctly read. The corresponding probability is :</p>
<p><span class="math display">\[
\begin{aligned}
P_{TM} &amp;= (1- PE_1)(1-PE_2)\\
&amp;=(1 - 10^{-\frac{Q_1}{10} } )(1 - 10^{-\frac{Q_2}{10}} )
\end{aligned}
\]</span></p>
<p>Secondly, a match can occure if the true nucleotides read as <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> are not <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> but identical.</p>
<p><span class="math display">\[
\begin{aligned}
P(X_1==X_{E1}) \cap P(X_2==X_{E1}) &amp;= \frac{P_{E1} P_{E2}}{9} \\
P(X_1==X_{Ex}) \cap P(X_2==X_{Ex}) &amp; = \frac{P_{E1} P_{E2}}{3}
\end{aligned}
\]</span></p>
<p>The probability of a true match between <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> when <span class="math inline">\(X_1 = X_2\)</span> an observed match :</p>
<p><span class="math display">\[
\begin{aligned}
P(MATCH | X_1 = X_2) = (1- PE_1)(1-PE_2) + \frac{P_{E1} P_{E2}}{3}
\end{aligned}
\]</span></p>
<p><strong>Probability of a true match when <span class="math inline">\(X_1 \neq X_2\)</span></strong></p>
<p>That probability can be divided in three parts.</p>
<ol type="a">
<li><span class="math inline">\(X_1\)</span> has been correctly read and <span class="math inline">\(X_2\)</span> is a sequencing error and is actually equal to <span class="math inline">\(X_1\)</span>. <span class="math display">\[
P_a = (1-P_{E1})\frac{P_{E2}}{3}
\]</span></li>
<li><span class="math inline">\(X_2\)</span> has been correctly read and <span class="math inline">\(X_1\)</span> is a sequencing error and is actually equal to <span class="math inline">\(X_2\)</span>. <span class="math display">\[
P_b = (1-P_{E2})\frac{P_{E1}}{3}
\]</span></li>
<li><span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> corresponds to sequencing error but are actually the same base <span class="math inline">\(X_{Ex}\)</span> <span class="math display">\[
P_c = 2\frac{P_{E1} P_{E2}}{9}
\]</span></li>
</ol>
<p>Consequently : <span class="math display">\[
\begin{aligned}
P(MATCH | X_1 \neq X_2) = (1-P_{E1})\frac{P_{E2}}{3} + (1-P_{E2})\frac{P_{E1}}{3} + 2\frac{P_{E1} P_{E2}}{9}
\end{aligned}
\]</span></p>
<p><strong>Probability of a match under the random model</strong></p>
<div class="cell">
<div class="cell-output-display">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="commands_files/figure-html/unnamed-chunk-1-1.png" class="img-fluid figure-img" width="672"></p>
<p></p><figcaption class="figure-caption">Evolution of the match and mismatch scores when the quality of base is 20 while the second range from 10 to 40.</figcaption><p></p>
</figure>
</div>
</div>
</div>
</section>
<section id="obimultiplex" class="level4" data-number="2.7.1.3">
<h4 data-number="2.7.1.3" class="anchored" data-anchor-id="obimultiplex"><span class="header-section-number">2.7.1.3</span> <code>obimultiplex</code></h4>
<blockquote class="blockquote">
<p>Replace the <code>ngsfilter</code> original <em>OBITools</em></p>
</blockquote>
</section>
<section id="obicomplement" class="level4" data-number="2.7.1.2">
<h4 data-number="2.7.1.2" class="anchored" data-anchor-id="obicomplement"><span class="header-section-number">2.7.1.2</span> <code>obicomplement</code></h4>
<section id="obicomplement" class="level4" data-number="2.7.1.4">
<h4 data-number="2.7.1.4" class="anchored" data-anchor-id="obicomplement"><span class="header-section-number">2.7.1.4</span> <code>obicomplement</code></h4>
</section>
<section id="obiclean" class="level4" data-number="2.7.1.3">
<h4 data-number="2.7.1.3" class="anchored" data-anchor-id="obiclean"><span class="header-section-number">2.7.1.3</span> <code>obiclean</code></h4>
<section id="obiclean" class="level4" data-number="2.7.1.5">
<h4 data-number="2.7.1.5" class="anchored" data-anchor-id="obiclean"><span class="header-section-number">2.7.1.5</span> <code>obiclean</code></h4>
</section>
<section id="obiuniq" class="level4" data-number="2.7.1.4">
<h4 data-number="2.7.1.4" class="anchored" data-anchor-id="obiuniq"><span class="header-section-number">2.7.1.4</span> <code>obiuniq</code></h4>
<section id="obiuniq" class="level4" data-number="2.7.1.6">
<h4 data-number="2.7.1.6" class="anchored" data-anchor-id="obiuniq"><span class="header-section-number">2.7.1.6</span> <code>obiuniq</code></h4>
</section>
</section>
</section>
@ -333,6 +426,11 @@ ul.task-list li input[type="checkbox"] {
</blockquote>
<div id="refs" class="references csl-bib-body hanging-indent" role="doc-bibliography" style="display: none">
<div id="ref-Lipman1985-hw" class="csl-entry" role="doc-biblioentry">
Lipman, D J, and W R Pearson. 1985. <span><span class="nocase">Rapid and sensitive protein similarity searches</span>.”</span> <em>Science</em> 227 (4693): 143541. <a href="http://www.ncbi.nlm.nih.gov/pubmed/2983426">http://www.ncbi.nlm.nih.gov/pubmed/2983426</a>.
</div>
</div>
</section>
</section>
</section>

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

View File

@ -22,6 +22,25 @@ ul.task-list li input[type="checkbox"] {
margin: 0 0.8em 0.2em -1.6em;
vertical-align: middle;
}
div.csl-bib-body { }
div.csl-entry {
clear: both;
}
.hanging div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
}
</style>
@ -106,7 +125,7 @@ ul.task-list li input[type="checkbox"] {
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">
@ -172,8 +191,16 @@ ul.task-list li input[type="checkbox"] {
<section id="preface" class="level1 unnumbered">
<h1 class="unnumbered">Preface</h1>
<p>The first version of <em>OBITools</em> started to be developed in 2005. This was at the beginning of the DNA metabarcoding story at the Laboratoire dEcologie Alpine (LECA) in Grenoble. At that time, with Pierre Taberlet and François Pompanon, we were thinking about the potential of this new methodology under development. PIerre and François developed more the laboratory methods, while I was thinking more about the tools for analysing the sequences produced. Two ideas were behind this development. I wanted something modular, and something easy to extend. To achieve the first goal, I decided to implement obitools as a suite of unix commands mimicking the classic unix commands but dedicated to sequence files. The basic unix commands are very useful for automatically manipulating, parsing and editing text files. They work in flow, line by line on the input text. The result is a new text file that can be used as input for the next command. Such a design makes it possible to quickly develop a text processing pipeline by chaining simple elementary operations. The <em>OBITools</em> are the exact counterpart of these basic Unix commands, but the basic information they process is a sequence (potentially spanning several lines of text), not a single line of text. Most <em>OBITools</em> consume sequence files and produce sequence files. Thus, the principles of chaining and modularity are respected. In order to be able to easily extend the <em>OBITools</em> to keep up with our evolving ideas about processing DNA metabarcoding data, it was decided to develop them using an interpreted language: Python. Python 2, the version available at the time, allowed us to develop the <em>OBITools</em> efficiently. When parts of the algorithms were computationally demanding, they were implemented in C and linked to the Python code. Even though Python is not the most efficient language available, even though computers were not as powerful as they are today, the size of the data we could produce using 454 sequencers or early solexa machines was small enough to be processed in a reasonable time.</p>
<p>The first public version of obitools was <a href="https://metabarcoding.org/obitools"><em>OBITools2</em></a> <span class="citation" data-cites="Boyer2016-gq">(<a href="references.html#ref-Boyer2016-gq" role="doc-biblioref">Boyer et al. 2016</a>)</span>, this was actually a cleaned up and documented version of <em>OBITools</em> that had been running at LECA for years and was not really distributed except to a few collaborators. This is where <em>OBITools</em> started its public life from then on. The DNA metabarcoding spring schools provided and still provide user training every year. But <em>OBITools2</em> soon suffered from two limitations: it was developed in Python2, which was increasingly abandoned in favour of Python3, and the data size kept increasing with the new illumina machines. Pythons intrinsic slowness coupled with the increasing size of the datasets made OBITools computation times increasingly long. The abandonment of all maintenance of Python2 by its developers also imposed the need for a new version of OBITools.</p>
<p><a href="https://metabarcoding.org/obitools3"><em>OBITools3</em></a> was the first response to this crisis. Developed and maintained by <a href="https://www.celine-mercier.info">Céline Mercier</a>, <em>OBITools3</em> attempted to address several limitations of <em>OBITools2</em>. It is a complete new code, mainly developed in Python3, with most of the lower layer code written in C for efficiency. OBITools3 has also abandoned text files for binary files for the same reason of efficiency. They have been replaced by a database structure that keeps track of every operation performed on the data.</p>
<p>Here we present <em>OBITools4</em> which can be seen as a return to the origins of OBITools. While <em>OBITools3</em> offered traceability of analyses, which is in line with the concept of open science, and faster execution, <em>OBITools2</em> was more versatile and not only usable for the analysis of DNA metabarcoding data. <em>OBITools4</em> is the third full implementation of <em>OBITools</em>. The idea behind this new version is to go back to the original design of <em>OBITools</em> which ran on text files containing sequences, like the classic Unix commands, but running at least as fast as <em>OBITools3</em> and taking advantage of the multicore architecture of all modern laptops. For this, the idea of relying on an interpreted language was abandoned. The <em>OBITools4</em> are now fully implemented in the <a href="https://go.dev">GO</a> language with the exception of a few small pieces of specific code already implemented very efficiently in C. <em>OBITools4</em> also implement a new format for the annotations inserted in the header of every sequences. Rather tha relying on a format specific to <em>OBITools</em>, by default <em>OBITools4</em> use the <a href="https://www.json.org">JSON</a> format. This simplifies the writing of parsers in any languages, and thus allows obitools to easiestly interact with other software.</p>
<div id="refs" class="references csl-bib-body hanging-indent" role="doc-bibliography" style="display: none">
<div id="ref-Boyer2016-gq" class="csl-entry" role="doc-biblioentry">
Boyer, Frédéric, Céline Mercier, Aurélie Bonin, Yvan Le Bras, Pierre Taberlet, and Eric Coissac. 2016. <span><span class="nocase">obitools: a unix-inspired software package for DNA metabarcoding</span>.”</span> <em>Molecular Ecology Resources</em> 16 (1): 17682. <a href="https://doi.org/10.1111/1755-0998.12428">https://doi.org/10.1111/1755-0998.12428</a>.
</div>
</div>
</section>
</main> <!-- /main -->

View File

@ -125,7 +125,7 @@ div.csl-indent {
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">
@ -525,7 +525,7 @@ window.document.addEventListener("DOMContentLoaded", function (event) {
</div>
<div class="nav-page nav-page-next">
<a href="./commands.html" class="pagination-link">
<span class="nav-page-text"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></span> <i class="bi bi-arrow-right-short"></i>
<span class="nav-page-text"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></span> <i class="bi bi-arrow-right-short"></i>
</a>
</div>
</nav>

View File

@ -168,7 +168,7 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">
@ -200,6 +200,13 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni
<li><a href="#creating-new-instances" id="toc-creating-new-instances" class="nav-link" data-scroll-target="#creating-new-instances"><span class="toc-section-number">3.1.1</span> Creating new instances</a></li>
<li><a href="#end-of-life-of-a-biosequence-instance" id="toc-end-of-life-of-a-biosequence-instance" class="nav-link" data-scroll-target="#end-of-life-of-a-biosequence-instance"><span class="toc-section-number">3.1.2</span> End of life of a <code>BioSequence</code> instance</a></li>
<li><a href="#accessing-to-the-elements-of-a-sequence" id="toc-accessing-to-the-elements-of-a-sequence" class="nav-link" data-scroll-target="#accessing-to-the-elements-of-a-sequence"><span class="toc-section-number">3.1.3</span> Accessing to the elements of a sequence</a></li>
<li><a href="#the-annotations-of-a-sequence" id="toc-the-annotations-of-a-sequence" class="nav-link" data-scroll-target="#the-annotations-of-a-sequence"><span class="toc-section-number">3.1.4</span> The annotations of a sequence</a></li>
</ul></li>
<li><a href="#the-sequence-iterator" id="toc-the-sequence-iterator" class="nav-link" data-scroll-target="#the-sequence-iterator"><span class="toc-section-number">3.2</span> The sequence iterator</a>
<ul class="collapse">
<li><a href="#basic-usage-of-a-sequence-iterator" id="toc-basic-usage-of-a-sequence-iterator" class="nav-link" data-scroll-target="#basic-usage-of-a-sequence-iterator"><span class="toc-section-number">3.2.1</span> Basic usage of a sequence iterator</a></li>
<li><a href="#the-pipable-functions" id="toc-the-pipable-functions" class="nav-link" data-scroll-target="#the-pipable-functions"><span class="toc-section-number">3.2.2</span> The <code>Pipable</code> functions</a></li>
<li><a href="#the-teeable-functions" id="toc-the-teeable-functions" class="nav-link" data-scroll-target="#the-teeable-functions"><span class="toc-section-number">3.2.3</span> The <code>Teeable</code> functions</a></li>
</ul></li>
</ul>
</nav>
@ -261,7 +268,7 @@ sequence</code></pre>
</section>
<section id="end-of-life-of-a-biosequence-instance" class="level3" data-number="3.1.2">
<h3 data-number="3.1.2" class="anchored" data-anchor-id="end-of-life-of-a-biosequence-instance"><span class="header-section-number">3.1.2</span> End of life of a <code>BioSequence</code> instance</h3>
<p>When a <code>BioSequence</code> instance is no more used, it is normally taken in charge by the GO garbage collector. You can if you want call the <code>Recycle</code> method on the instance to store the allocated memory element in a <code>pool</code> to limit allocation effort when many sequences are manipulated.</p>
<p>When an instance of <code>BioSequence</code> is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the <code>Recycle</code> method on it to store the allocated memory elements in a <code>pool</code> to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.</p>
</section>
<section id="accessing-to-the-elements-of-a-sequence" class="level3" data-number="3.1.3">
<h3 data-number="3.1.3" class="anchored" data-anchor-id="accessing-to-the-elements-of-a-sequence"><span class="header-section-number">3.1.3</span> Accessing to the elements of a sequence</h3>
@ -330,9 +337,68 @@ sequence</code></pre>
<li><code>WriteByteQualities(data byte) error</code></li>
</ul>
<p>In a way analogous to the <code>Clear</code> method, <code>ClearQualities()</code> empties the sequence of quality scores.</p>
</section>
</section>
<section id="the-annotations-of-a-sequence" class="level3" data-number="3.1.4">
<h3 data-number="3.1.4" class="anchored" data-anchor-id="the-annotations-of-a-sequence"><span class="header-section-number">3.1.4</span> The annotations of a sequence</h3>
<p>A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name. The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.</p>
<ul>
<li>Scalar types:
<ul>
<li>integer</li>
<li>numeric</li>
<li>character</li>
<li>boolean</li>
</ul></li>
<li>Container types:
<ul>
<li>vector</li>
<li>map</li>
</ul></li>
</ul>
<p>Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.</p>
<p>Annotations are stored in an object of type <code>bioseq.Annotation</code> which is an alias of <code>map[string]interface{}</code>. This map can be retrieved using the <code>Annotations() Annotation</code> method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of <code>BioSequence</code> using its <code>HasAnnotation() bool</code> method to see if it has any annotations associated with it.</p>
<ul>
<li>GetAttribute(key string) (interface{}, bool)</li>
</ul>
</section>
</section>
<section id="the-sequence-iterator" class="level2" data-number="3.2">
<h2 data-number="3.2" class="anchored" data-anchor-id="the-sequence-iterator"><span class="header-section-number">3.2</span> The sequence iterator</h2>
<p>The pakage <em>obiter</em> provides an iterator mecanism for manipulating sequences. The main class provided by this package is <code>obiiter.IBioSequence</code>. An <code>IBioSequence</code> iterator provides batch of sequences.</p>
<section id="basic-usage-of-a-sequence-iterator" class="level3" data-number="3.2.1">
<h3 data-number="3.2.1" class="anchored" data-anchor-id="basic-usage-of-a-sequence-iterator"><span class="header-section-number">3.2.1</span> Basic usage of a sequence iterator</h3>
<p>Many functions, among them functions reading sequences from a text file, return a <code>IBioSequence</code> iterator. The iterator class provides two main methods:</p>
<ul>
<li><code>Next() bool</code></li>
<li><code>Get() obiiter.BioSequenceBatch</code></li>
</ul>
<p>The <code>Next</code> method moves the iterator to the next value, while the <code>Get</code> method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode go code-with-copy"><code class="sourceCode go"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">import</span> <span class="op">(</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="op">)</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="kw">func</span> main<span class="op">()</span> <span class="op">{</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a> mydata <span class="op">:=</span> obiformats<span class="op">.</span>ReadFastSeqFromFile<span class="op">(</span><span class="st">"myfile.fasta"</span><span class="op">)</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> mydata<span class="op">.</span>Next<span class="op">()</span> <span class="op">{</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> data <span class="op">:=</span> mydata<span class="op">.</span>Get<span class="op">()</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a> <span class="co">//</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a> <span class="co">// Whatever you want to do with the data chunk</span></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a> <span class="co">//</span></span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<p>An <code>obiseq.BioSequenceBatch</code> instance is a set of sequences stored in an <code>obiseq.BioSequenceSlice</code> and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.</p>
</section>
<section id="the-pipable-functions" class="level3" data-number="3.2.2">
<h3 data-number="3.2.2" class="anchored" data-anchor-id="the-pipable-functions"><span class="header-section-number">3.2.2</span> The <code>Pipable</code> functions</h3>
<p>A function consuming a <code>obiiter.IBioSequence</code> and returning a <code>obiiter.IBioSequence</code> is of class <code>obiiter.Pipable</code>.</p>
</section>
<section id="the-teeable-functions" class="level3" data-number="3.2.3">
<h3 data-number="3.2.3" class="anchored" data-anchor-id="the-teeable-functions"><span class="header-section-number">3.2.3</span> The <code>Teeable</code> functions</h3>
<p>A function consuming a <code>obiiter.IBioSequence</code> and returning two <code>obiiter.IBioSequence</code> instance is of class <code>obiiter.Teeable</code>.</p>
</section>
</section>
@ -473,7 +539,7 @@ window.document.addEventListener("DOMContentLoaded", function (event) {
<nav class="page-navigation">
<div class="nav-page nav-page-previous">
<a href="./commands.html" class="pagination-link">
<i class="bi bi-arrow-left-short"></i> <span class="nav-page-text"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></span>
<i class="bi bi-arrow-left-short"></i> <span class="nav-page-text"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></span>
</a>
</div>
<div class="nav-page nav-page-next">

View File

@ -123,7 +123,7 @@ div.csl-indent {
</li>
<li class="sidebar-item">
<div class="sidebar-item-container">
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools</em> commands</span></a>
<a href="./commands.html" class="sidebar-item-text sidebar-link"><span class="chapter-number">2</span>&nbsp; <span class="chapter-title">The <em>OBITools V4</em> commands</span></a>
</div>
</li>
<li class="sidebar-item">
@ -169,12 +169,23 @@ div.csl-indent {
</header>
<div id="refs" class="references csl-bib-body hanging-indent" role="doc-bibliography">
<div id="ref-Boyer2016-gq" class="csl-entry" role="doc-biblioentry">
Boyer, Frédéric, Céline Mercier, Aurélie Bonin, Yvan Le Bras, Pierre
Taberlet, and Eric Coissac. 2016. <span><span class="nocase">obitools:
a unix-inspired software package for DNA metabarcoding</span>.”</span>
<em>Molecular Ecology Resources</em> 16 (1): 17682. <a href="https://doi.org/10.1111/1755-0998.12428">https://doi.org/10.1111/1755-0998.12428</a>.
</div>
<div id="ref-cock2010sanger" class="csl-entry" role="doc-biblioentry">
Cock, Peter JA, Christopher J Fields, Naohisa Goto, Michael L Heuer, and
Peter M Rice. 2010. <span>“The Sanger FASTQ File Format for Sequences
with Quality Scores, and the Solexa/Illumina FASTQ Variants.”</span>
<em>Nucleic Acids Research</em> 38 (6): 176771.
</div>
<div id="ref-Lipman1985-hw" class="csl-entry" role="doc-biblioentry">
Lipman, D J, and W R Pearson. 1985. <span><span class="nocase">Rapid
and sensitive protein similarity searches</span>.”</span>
<em>Science</em> 227 (4693): 143541. <a href="http://www.ncbi.nlm.nih.gov/pubmed/2983426">http://www.ncbi.nlm.nih.gov/pubmed/2983426</a>.
</div>
</div>

View File

@ -8,3 +8,73 @@
year={2010},
publisher={Oxford University Press}
}
@ARTICLE{Boyer2016-gq,
title = "{obitools: a unix-inspired software package for DNA metabarcoding}",
author = "Boyer, Fr{\'e}d{\'e}ric and Mercier, C{\'e}line and Bonin,
Aur{\'e}lie and Le Bras, Yvan and Taberlet, Pierre and Coissac,
Eric",
abstract = "DNA metabarcoding offers new perspectives in biodiversity
research. This recently developed approach to ecosystem study
relies heavily on the use of next-generation sequencing (NGS)
and thus calls upon the ability to deal with huge sequence data
sets. The obitools package satisfies this requirement thanks to
a set of programs specifically designed for analysing NGS data
in a DNA metabarcoding context. Their capacity to filter and
edit sequences while taking into account taxonomic annotation
helps to set up tailor-made analysis pipelines for a broad range
of DNA metabarcoding applications, including biodiversity
surveys or diet analyses. The obitools package is distributed as
an open source software available on the following website:
http://metabarcoding.org/obitools. A Galaxy wrapper is available
on the GenOuest core facility toolshed:
http://toolshed.genouest.org.",
journal = "Molecular ecology resources",
publisher = "Wiley Online Library",
volume = 16,
number = 1,
pages = "176--182",
month = jan,
year = 2016,
url = "http://dx.doi.org/10.1111/1755-0998.12428",
keywords = "PCR errors; biodiversity; next-generation sequencing; sequence
analysis; taxonomic annotation",
language = "en",
issn = "1755-098X, 1755-0998",
pmid = "25959493",
doi = "10.1111/1755-0998.12428"
}
@article{Lipman1985-hw,
abstract = {An algorithm was developed which facilitates the search for
similarities between newly determined amino acid sequences and
sequences already available in databases. Because of the
algorithm's efficiency on many microcomputers, sensitive protein
database searches may now become a routine procedure for
molecular biologists. The method efficiently identifies regions
of similar sequence and then scores the aligned identical and
differing residues in those regions by means of an amino acid
replacability matrix. This matrix increases sensitivity by giving
high scores to those amino acid replacements which occur
frequently in evolution. The algorithm has been implemented in a
computer program designed to search protein databases very
rapidly. For example, comparison of a 200-amino-acid sequence to
the 500,000 residues in the National Biomedical Research
Foundation library would take less than 2 minutes on a
minicomputer, and less than 10 minutes on a microcomputer (IBM
PC).},
author = {Lipman, D J and Pearson, W R},
date-added = {2023-01-26 15:17:10 +0100},
date-modified = {2023-01-26 15:17:10 +0100},
issn = {0036-8075},
journal = {Science},
month = mar,
number = 4693,
pages = {1435--1441},
pmid = {2983426},
title = {{Rapid and sensitive protein similarity searches}},
url = {http://www.ncbi.nlm.nih.gov/pubmed/2983426},
volume = 227,
year = 1985,
bdsk-url-1 = {http://www.ncbi.nlm.nih.gov/pubmed/2983426}}

View File

@ -1,4 +1,4 @@
# The *OBITools* commands
# The *OBITools V4* commands
## Specifying the input files to *OBITools* commands
@ -96,6 +96,131 @@ sequence is the sequence object on which the expression is evaluated
> Replace the `illuminapairedends` original *OBITools*
#### Alignment procedure
`obipairing` is introducing a new alignment algorithm compared to the `illuminapairedend` command of the `OBITools V2`.
Nethertheless this new algorithm has been design to produce the same results than the previous, except in very few cases.
The new algorithm is a two-step procedure. First, a FASTN-type algorithm [@Lipman1985-hw] identifies the best offset between the two matched readings. This identifies the region of overlap.
In the second step, the matching regions of the two reads are extracted along with a flanking sequence of $\Delta$ base pairs. The two subsequences are then aligned using a "one side free end-gap" dynamic programming algorithm. This latter step is only called if at least one mismatch is detected by the FASTP step.
Unless the similarity between the two reads at their overlap region is very low, the addition of the flanking regions in the second step of the alignment ensures the same alignment as if the dynamic programming alignment was performed on the full reads.
#### The scoring system
In the dynamic programming step, the match and mismatch scores take into account the quality scores of the two aligned nucleotides. By taking these into account, the probability of a true match can be calculated for each aligned base pair.
If we consider a nucleotide read with a quality score $Q$, the probability of misreading this base ($P_E$) is :
$$
P_E = 10^{-\frac{Q}{10}}
$$
Thus, when a given nucleotide $X$ is observed with the quality score $Q$. The probability that $X$ is really an $X$ is :
$$
P(X=X) = 1 - P_E
$$
Otherwise, $X$ is actually one of the three other possible nucleotides ($X_{E1}$, $X_{E2}$ or $X_{E3}$). If we suppose that the three reading error have the same probability :
$$
P(X=X_{E1}) = P(X=X_{E3}) = P(X=X_{E3}) = \frac{P_E}{3}
$$
At each position in an alignment where the two nucleotides $X_1$ and $X_2$ face each other (not a gapped position), the probability of a true match varies depending on whether $X_1=X_2$, an observed match, or $X_1 \neq X_2$, an observed mismatch.
**Probability of a true match when $X_1=X_2$**
That probability can be divided in two parts. First $X_1$ and $X_2$ have been correctly read. The corresponding probability is :
$$
\begin{aligned}
P_{TM} &= (1- PE_1)(1-PE_2)\\
&=(1 - 10^{-\frac{Q_1}{10} } )(1 - 10^{-\frac{Q_2}{10}} )
\end{aligned}
$$
Secondly, a match can occure if the true nucleotides read as $X_1$ and $X_2$ are not $X_1$ and $X_2$ but identical.
$$
\begin{aligned}
P(X_1==X_{E1}) \cap P(X_2==X_{E1}) &= \frac{P_{E1} P_{E2}}{9} \\
P(X_1==X_{Ex}) \cap P(X_2==X_{Ex}) & = \frac{P_{E1} P_{E2}}{3}
\end{aligned}
$$
The probability of a true match between $X_1$ and $X_2$ when $X_1 = X_2$ an observed match :
$$
\begin{aligned}
P(MATCH | X_1 = X_2) = (1- PE_1)(1-PE_2) + \frac{P_{E1} P_{E2}}{3}
\end{aligned}
$$
**Probability of a true match when $X_1 \neq X_2$**
That probability can be divided in three parts.
a. $X_1$ has been correctly read and $X_2$ is a sequencing error and is actually equal to $X_1$.
$$
P_a = (1-P_{E1})\frac{P_{E2}}{3}
$$
a. $X_2$ has been correctly read and $X_1$ is a sequencing error and is actually equal to $X_2$.
$$
P_b = (1-P_{E2})\frac{P_{E1}}{3}
$$
a. $X_1$ and $X_2$ corresponds to sequencing error but are actually the same base $X_{Ex}$
$$
P_c = 2\frac{P_{E1} P_{E2}}{9}
$$
Consequently :
$$
\begin{aligned}
P(MATCH | X_1 \neq X_2) = (1-P_{E1})\frac{P_{E2}}{3} + (1-P_{E2})\frac{P_{E1}}{3} + 2\frac{P_{E1} P_{E2}}{9}
\end{aligned}
$$
**Probability of a match under the random model**
```{r}
#| echo: false
#| warning: false
#| fig-cap: "Evolution of the match and mismatch scores when the quality of base is 20 while the second range from 10 to 40."
require(ggplot2)
require(tidyverse)
Smatch <- function(Q1,Q2) {
PE1 <- 10^(-Q1/10)
PE2 <- 10^(-Q2/10)
PT1 <- 1 - PE1
PT2 <- 1 - PE2
PM <- PT1*PT2 + PE1 * PE2 / 3
round((log(PM)+log(4))*10)
}
Smismatch <- function(Q1,Q2) {
PE1 <- 10^(-Q1/10)
PE2 <- 10^(-Q2/10)
PT1 <- 1 - PE1
PT2 <- 1 - PE2
PM <- PE1*PT2/3 + PT1 * PE2 / 3 + 2/3 * PE1 * PE2
round((log(PM)+log(4))*10)
}
tibble(Q = 10:40) %>%
mutate(Match = mapply(Smatch,Q,20),
Mismatch = mapply(Smismatch,Q,20),
) %>% pivot_longer(cols = -Q, names_to = "Class", values_to = "Score") %>%
ggplot(aes(x=Q,y=Score,col=Class)) +
geom_line() +
xlab("Q1 (Q2=20)")
```
#### `obimultiplex`
> Replace the `ngsfilter` original *OBITools*

View File

@ -2,4 +2,8 @@
The first version of *OBITools* started to be developed in 2005. This was at the beginning of the DNA metabarcoding story at the Laboratoire d'Ecologie Alpine (LECA) in Grenoble. At that time, with Pierre Taberlet and François Pompanon, we were thinking about the potential of this new methodology under development. PIerre and François developed more the laboratory methods, while I was thinking more about the tools for analysing the sequences produced. Two ideas were behind this development. I wanted something modular, and something easy to extend. To achieve the first goal, I decided to implement obitools as a suite of unix commands mimicking the classic unix commands but dedicated to sequence files. The basic unix commands are very useful for automatically manipulating, parsing and editing text files. They work in flow, line by line on the input text. The result is a new text file that can be used as input for the next command. Such a design makes it possible to quickly develop a text processing pipeline by chaining simple elementary operations. The *OBITools* are the exact counterpart of these basic Unix commands, but the basic information they process is a sequence (potentially spanning several lines of text), not a single line of text. Most *OBITools* consume sequence files and produce sequence files. Thus, the principles of chaining and modularity are respected. In order to be able to easily extend the *OBITools* to keep up with our evolving ideas about processing DNA metabarcoding data, it was decided to develop them using an interpreted language: Python. Python 2, the version available at the time, allowed us to develop the *OBITools* efficiently. When parts of the algorithms were computationally demanding, they were implemented in C and linked to the Python code. Even though Python is not the most efficient language available, even though computers were not as powerful as they are today, the size of the data we could produce using 454 sequencers or early solexa machines was small enough to be processed in a reasonable time.
The first public version of obitools was [*OBITools2*](https://metabarcoding.org/obitools) [@Boyer2016-gq], this was actually a cleaned up and documented version of *OBITools* that had been running at LECA for years and was not really distributed except to a few collaborators. This is where *OBITools* started its public life from then on. The DNA metabarcoding spring schools provided and still provide user training every year. But *OBITools2* soon suffered from two limitations: it was developed in Python2, which was increasingly abandoned in favour of Python3, and the data size kept increasing with the new illumina machines. Python's intrinsic slowness coupled with the increasing size of the datasets made OBITools computation times increasingly long. The abandonment of all maintenance of Python2 by its developers also imposed the need for a new version of OBITools.
[*OBITools3*](https://metabarcoding.org/obitools3) was the first response to this crisis. Developed and maintained by [Céline Mercier](https://www.celine-mercier.info), *OBITools3* attempted to address several limitations of *OBITools2*. It is a complete new code, mainly developed in Python3, with most of the lower layer code written in C for efficiency. OBITools3 has also abandoned text files for binary files for the same reason of efficiency. They have been replaced by a database structure that keeps track of every operation performed on the data.
Here we present *OBITools4* which can be seen as a return to the origins of OBITools. While *OBITools3* offered traceability of analyses, which is in line with the concept of open science, and faster execution, *OBITools2* was more versatile and not only usable for the analysis of DNA metabarcoding data. *OBITools4* is the third full implementation of *OBITools*. The idea behind this new version is to go back to the original design of *OBITools* which ran on text files containing sequences, like the classic Unix commands, but running at least as fast as *OBITools3* and taking advantage of the multicore architecture of all modern laptops. For this, the idea of relying on an interpreted language was abandoned. The *OBITools4* are now fully implemented in the [GO](https://go.dev) language with the exception of a few small pieces of specific code already implemented very efficiently in C. *OBITools4* also implement a new format for the annotations inserted in the header of every sequences. Rather tha relying on a format specific to *OBITools*, by default *OBITools4* use the [JSON](https://www.json.org) format. This simplifies the writing of parsers in any languages, and thus allows obitools to easiestly interact with other software.

View File

@ -58,11 +58,8 @@ When formated as fasta the parameters correspond to the following schema
### End of life of a `BioSequence` instance
When a `BioSequence` instance is no more used, it is normally taken in
charge by the GO garbage collector. You can if you want call the
`Recycle` method on the instance to store the allocated memory element
in a `pool` to limit allocation effort when many sequences are
manipulated.
When an instance of `BioSequence` is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the `Recycle` method on it to store the allocated memory elements in a `pool` to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.
### Accessing to the elements of a sequence
@ -154,4 +151,62 @@ conterpart.
In a way analogous to the `Clear` method, `ClearQualities()` empties the
sequence of quality scores.
### The annotations of a sequence
A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name.
The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.
- Scalar types:
- integer
- numeric
- character
- boolean
- Container types:
- vector
- map
Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.
Annotations are stored in an object of type `bioseq.Annotation` which is an alias of `map[string]interface{}`. This map can be retrieved using the `Annotations() Annotation` method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of `BioSequence` using its `HasAnnotation() bool` method to see if it has any annotations associated with it.
- GetAttribute(key string) (interface{}, bool)
## The sequence iterator
The pakage *obiter* provides an iterator mecanism for manipulating sequences. The main class provided by this package is `obiiter.IBioSequence`. An `IBioSequence` iterator provides batch of sequences.
### Basic usage of a sequence iterator
Many functions, among them functions reading sequences from a text file, return a `IBioSequence` iterator. The iterator class provides two main methods:
- `Next() bool`
- `Get() obiiter.BioSequenceBatch`
The `Next` method moves the iterator to the next value, while the `Get` method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.
``` go
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"
)
func main() {
mydata := obiformats.ReadFastSeqFromFile("myfile.fasta")
for mydata.Next() {
data := mydata.Get()
//
// Whatever you want to do with the data chunk
//
}
}
```
An `obiseq.BioSequenceBatch` instance is a set of sequences stored in an `obiseq.BioSequenceSlice` and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.
### The `Pipable` functions
A function consuming a `obiiter.IBioSequence` and returning a `obiiter.IBioSequence` is of class `obiiter.Pipable`.
### The `Teeable` functions
A function consuming a `obiiter.IBioSequence` and returning two `obiiter.IBioSequence` instance is of class `obiiter.Teeable`.