<h1class="quarto-secondary-nav-title"><spanclass="chapter-number">12</span> <spanclass="chapter-title">Sequence sampling and filtering</span></h1>
<ahref="./installation.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">1</span> <spanclass="chapter-title">Installation of the obitools</span></a>
</div>
</li>
<liclass="sidebar-item">
<divclass="sidebar-item-container">
<ahref="./formats.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">2</span> <spanclass="chapter-title">File formats usable with <em>OBITools</em></span></a>
<ahref="./inupt.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">4</span> <spanclass="chapter-title">Specifying the data input to <em>OBITools</em> commands</span></a>
<ahref="./common_options.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">6</span> <spanclass="chapter-title">Options common to most of the <em>OBITools</em> commands</span></a>
<ahref="./comm_metabarcode_design.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">8</span> <spanclass="chapter-title">Metabarcode design and quality assessment</span></a>
</div>
</li>
<liclass="sidebar-item">
<divclass="sidebar-item-container">
<ahref="./comm_reformat.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">9</span> <spanclass="chapter-title">File format conversions</span></a>
<ahref="./comm_computation.html"class="sidebar-item-text sidebar-link"><spanclass="chapter-number">11</span> <spanclass="chapter-title">Computations on sequences</span></a>
</div>
</li>
<liclass="sidebar-item">
<divclass="sidebar-item-container">
<ahref="./comm_sampling.html"class="sidebar-item-text sidebar-link active"><spanclass="chapter-number">12</span> <spanclass="chapter-title">Sequence sampling and filtering</span></a>
<li><ahref="#obigrep-filters-sequence-files-according-to-numerous-conditions"id="toc-obigrep-filters-sequence-files-according-to-numerous-conditions"class="nav-link active"data-scroll-target="#obigrep-filters-sequence-files-according-to-numerous-conditions"><spanclass="toc-section-number">12.1</span><code>obigrep</code>– filters sequence files according to numerous conditions</a>
<ulclass="collapse">
<li><ahref="#the-options-usable-with-obigrep"id="toc-the-options-usable-with-obigrep"class="nav-link"data-scroll-target="#the-options-usable-with-obigrep"><spanclass="toc-section-number">12.1.1</span> The options usable with <code>obigrep</code></a></li>
<h2data-number="12.1"class="anchored"data-anchor-id="obigrep-filters-sequence-files-according-to-numerous-conditions"><spanclass="header-section-number">12.1</span><code>obigrep</code>– filters sequence files according to numerous conditions</h2>
<p>The <code>obigrep</code> command is somewhat analogous to the standard Unix <code>grep</code> command. It selects a subset of sequence records from a sequence file. A sequence record is a complex object consisting of an identifier, a set of attributes (a key, defined by its name, associated with a value), a definition, and the sequence itself. Instead of working text line by text line like the standard Unix tool, <code>obigrep</code> selection is done sequence record by sequence record. A large number of options allow you to refine the selection on any element of the sequence. <code>obigrep</code> allows you to specify multiple conditions simultaneously (which take on the value <code>TRUE</code> or <code>FALSE</code>) and only those sequence records which meet all conditions (all conditions are <code>TRUE</code>) are selected. <code>obigrep</code> is able to work on two paired read files. The selection criteria apply to one or the other of the readings in each pair depending on the mode chosen (<strong>--paired-mode</strong> option). In all cases the selection is applied in the same way to both files, thus maintaining their consistency.</p>
<h3data-number="12.1.1"class="anchored"data-anchor-id="the-options-usable-with-obigrep"><spanclass="header-section-number">12.1.1</span> The options usable with <code>obigrep</code></h3>
<h4data-number="12.1.1.1"class="anchored"data-anchor-id="selecting-sequences-based-on-their-caracteristics"><spanclass="header-section-number">12.1.1.1</span> Selecting sequences based on their caracteristics</h4>
<p>Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.</p>
<p><strong>Selection based on the sequence</strong></p>
<p>Sequence records can be selected according if they match or not with a pattern. The simplest pattern is as short sequence (<em>e.g</em><code>AACCTT</code>). But the usage of regular patterns allows for looking for more complex pattern. As example, <code>A[TG]C+G</code> matches a <code>A</code>, followed by a <code>T</code> or a <code>G</code>, then one or several <code>C</code> and endly a <code>G</code>.</p>
<p>Regular expression pattern to be tested against the sequence itself. The pattern is case insensitive. A complete description of the regular pattern grammar is available <ahref="https://yourbasic.org/golang/regexp-cheat-sheet/#cheat-sheet">here</a>.</p>
</dd>
<dt><em>Examples:</em></dt>
<dd>
<p>Selects only the sequence records that contain an <em>EcoRI</em> restriction site.</p>
</dd>
</dl>
<divclass="sourceCode"id="cb1"><preclass="sourceCode bash code-with-copy"><codeclass="sourceCode bash"><spanid="cb1-1"><ahref="#cb1-1"aria-hidden="true"tabindex="-1"></a><spanclass="ex">obigrep</span><spanclass="at">-s</span><spanclass="st">'GAATTC'</span> seq1.fasta <spanclass="op">></span> seq2.fasta</span></code><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></pre></div>
<p>: Selects only the sequence records that contain a stretch of at least 10 <code>A</code>.</p>
<divclass="sourceCode"id="cb2"><preclass="sourceCode bash code-with-copy"><codeclass="sourceCode bash"><spanid="cb2-1"><ahref="#cb2-1"aria-hidden="true"tabindex="-1"></a><spanclass="ex">obigrep</span><spanclass="at">-s</span><spanclass="st">'A{10,}'</span> seq1.fasta <spanclass="op">></span> seq2.fasta</span></code><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></pre></div>
<p>: Selects only the sequence records that do not contain ambiguous nucleotides.</p>
<divclass="sourceCode"id="cb3"><preclass="sourceCode bash code-with-copy"><codeclass="sourceCode bash"><spanid="cb3-1"><ahref="#cb3-1"aria-hidden="true"tabindex="-1"></a><spanclass="ex">obigrep</span><spanclass="at">-s</span><spanclass="st">'^[ACGT]+$'</span> seq1.fasta <spanclass="op">></span> seq2.fasta</span></code><buttontitle="Copy to Clipboard"class="code-copy-button"><iclass="bi"></i></button></pre></div>
<p>only sequences reprensenting at least <em>COUNT</em> reads will be selected. That option rely on the <code>count</code> attribute. If the <code>count</code> attribute is not defined for a sequence record, it is assumed equal to <spanclass="math inline">\(1\)</span>.</p>
<p>only sequences reprensenting no more than <em>COUNT</em> reads will be selected. That option rely on the <code>count</code> attribute. If the <code>count</code> attribute is not defined for a sequence record, it is assumed equal to <spanclass="math inline">\(1\)</span>.</p>
<iclass="bi bi-arrow-left-short"></i><spanclass="nav-page-text"><spanclass="chapter-number">11</span> <spanclass="chapter-title">Computations on sequences</span></span>