Files
obikmer/doc/implementation/chunkreader/index.html
T
Eric Coissac bb7adc1154 docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 22:59:41 +02:00

689 lines
22 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="../kmer/" rel="prev"/>
<link href="../pipeline/" rel="next"/>
<link href="../../assets/images/favicon.png" rel="icon"/>
<meta content="mkdocs-1.6.1, mkdocs-material-9.7.6" name="generator"/>
<title>Chunk reader - obikmer</title>
<link href="../../assets/stylesheets/main.484c7ddc.min.css" rel="stylesheet"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&amp;display=fallback" rel="stylesheet"/>
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
</head>
<body dir="ltr">
<input autocomplete="off" class="md-toggle" data-md-toggle="drawer" id="__drawer" type="checkbox"/>
<input autocomplete="off" class="md-toggle" data-md-toggle="search" id="__search" type="checkbox"/>
<label class="md-overlay" for="__drawer"></label>
<div data-md-component="skip">
<a class="md-skip" href="#chunk-reader-implementation">
Skip to content
</a>
</div>
<div data-md-component="announce">
</div>
<header class="md-header md-header--shadow" data-md-component="header">
<nav aria-label="Header" class="md-header__inner md-grid">
<a aria-label="obikmer" class="md-header__button md-logo" data-md-component="logo" href="../.." title="obikmer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
</a>
<label class="md-header__button md-icon" for="__drawer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"></path></svg>
</label>
<div class="md-header__title" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
obikmer
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
Chunk reader
</span>
</div>
</div>
</div>
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
</nav>
</header>
<div class="md-container" data-md-component="container">
<main class="md-main" data-md-component="main">
<div class="md-main__inner md-grid">
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation">
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav aria-label="Navigation" class="md-nav md-nav--primary" data-md-level="0">
<label class="md-nav__title" for="__drawer">
<a aria-label="obikmer" class="md-nav__button md-logo" data-md-component="logo" href="../.." title="obikmer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
</a>
obikmer
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../..">
<span class="md-ellipsis">
Home
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle" id="__nav_2" type="checkbox"/>
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
<span class="md-ellipsis">
Theory
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="false" aria-labelledby="__nav_2_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_2">
<span class="md-nav__icon md-icon"></span>
Theory
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../../kmers/">
<span class="md-ellipsis">
Kmers and super-kmers
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/encoding/">
<span class="md-ellipsis">
DNA encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/entropy/">
<span class="md-ellipsis">
Entropy filter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/minimizer/">
<span class="md-ellipsis">
Minimizer selection
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/indexing/">
<span class="md-ellipsis">
Partitioning architecture
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
<input checked="" class="md-nav__toggle md-toggle" id="__nav_3" type="checkbox"/>
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
<span class="md-ellipsis">
Implementation
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="true" aria-labelledby="__nav_3_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
Implementation
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../superkmer/">
<span class="md-ellipsis">
SuperKmer
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../kmer/">
<span class="md-ellipsis">
Kmer
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" id="__toc" type="checkbox"/>
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
Chunk reader
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a class="md-nav__link md-nav__link--active" href="./">
<span class="md-ellipsis">
Chunk reader
</span>
</a>
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis">
Two reading paths
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis">
Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#seqchunkiter">
<span class="md-ellipsis">
SeqChunkIter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fasta">
<span class="md-ellipsis">
Boundary detection — FASTA
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fastq">
<span class="md-ellipsis">
Boundary detection — FASTQ
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../pipeline/">
<span class="md-ellipsis">
Construction pipeline
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../obipipeline/">
<span class="md-ellipsis">
obipipeline library
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../storage/">
<span class="md-ellipsis">
On-disk storage
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../mphf/">
<span class="md-ellipsis">
MPHF selection
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../unitig_evidence/">
<span class="md-ellipsis">
Unitig evidence encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../evidence_elimination/">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../obilayeredmap/">
<span class="md-ellipsis">
obilayeredmap crate
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../persistent_compact_int_vec/">
<span class="md-ellipsis">
PersistentCompactIntVec
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../persistent_bit_vec/">
<span class="md-ellipsis">
PersistentBitVec
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../merge/">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../rebuild_filter/">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle" id="__nav_4" type="checkbox"/>
<label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="0">
<span class="md-ellipsis">
Architecture
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="false" aria-labelledby="__nav_4_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_4">
<span class="md-nav__icon md-icon"></span>
Architecture
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../../architecture/sequences/invariant/">
<span class="md-ellipsis">
Sequences
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../architecture/index_architecture/">
<span class="md-ellipsis">
Kmer index
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc">
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis">
Two reading paths
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis">
Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#seqchunkiter">
<span class="md-ellipsis">
SeqChunkIter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fasta">
<span class="md-ellipsis">
Boundary detection — FASTA
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fastq">
<span class="md-ellipsis">
Boundary detection — FASTQ
</span>
</a>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
<h2 id="two-reading-paths">Two reading paths</h2>
<table>
<thead>
<tr>
<th>Path</th>
<th>API</th>
<th>Output unit</th>
<th>Per-record identity</th>
<th>Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Record path</strong></td>
<td><code>read_sequence_chunks</code><code>parse_chunk</code></td>
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
<td>yes</td>
<td><code>query</code> — must read complete records</td>
</tr>
<tr>
<td><strong>Stream path</strong></td>
<td><code>open_nuc_stream</code></td>
<td><code>NucPage</code> (flat normalised byte buffer)</td>
<td>no</td>
<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
</tr>
</tbody>
</table>
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
<hr/>
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec&lt;SeqRecord&gt;</code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
<h2 id="output-type-rope">Output type: Rope</h2>
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec&lt;Cell&lt;u8&gt;&gt;</code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
<h2 id="seqchunkiter">SeqChunkIter</h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Rope</span><span class="o">&gt;</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
</code></pre></div>
<p><code>next()</code> loop:</p>
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option&lt;abs_offset&gt;
if Some(pos):
tail = rope.split_off(pos) ← O(log n), may split one block
chunk = mem::replace(&amp;mut rope, tail)
return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None
</code></pre></div>
<p>The <code>Splitter</code> function signature is <code>fn(&amp;Rope) -&gt; Option&lt;usize&gt;</code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
<p>Backward scan with a 2-state machine. Searches (right to left) for <code>&gt;</code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>&gt;</code> that is preceded by a newline in forward order):</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR
[*] --&gt; Scanning
Scanning --&gt; FoundGt : '&gt;'
FoundGt --&gt; Scanning : other
FoundGt --&gt; [*] : '\\n' / '\\r' ✓</code></pre>
<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record. Returns <code>None</code> if only one <code>&gt;</code> is found (cannot confirm there is a prior complete record).</p>
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
<p>7-state machine (states 06), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR
[*] --&gt; Scanning
Scanning --&gt; FoundPlus : '+' (save restart)
FoundPlus --&gt; AfterNlPlus : '\\n' / '\\r'
FoundPlus --&gt; Scanning : other → backtrack
AfterNlPlus --&gt; AfterNlPlus : séparateur
AfterNlPlus --&gt; InSequence : lettre / - / . / [ / ]
AfterNlPlus --&gt; Scanning : other → backtrack
InSequence --&gt; AfterSequence : '\\n' / '\\r'
InSequence --&gt; InSequence : lettre / - / . / [ / ]
InSequence --&gt; Scanning : other → backtrack
AfterSequence --&gt; AfterSequence : '\\n' / '\\r'
AfterSequence --&gt; InHeader : other
InHeader --&gt; FoundAt : '@' (save cut)
InHeader --&gt; Scanning : '\\n' / '\\r' → backtrack
InHeader --&gt; InHeader : other
FoundAt --&gt; [*] : '\\n' / '\\r' ✓
FoundAt --&gt; InHeader : other</code></pre>
<p><code>restart</code> is updated each time a <code>+</code> is found. When any state fails its expected input, the scan jumps back to <code>restart</code> and continues from there — guaranteeing that a <code>@</code> in a quality line cannot be accepted as a record start, because the <code>\n+\n</code> structure immediately following it (going backward) will not be found.</p>
<p>Returns the byte offset of the <code>@</code> that starts the last complete record.</p>
</article>
</div>
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
</div>
</main>
<footer class="md-footer">
<div class="md-footer-meta md-typeset">
<div class="md-footer-meta__inner md-grid">
<div class="md-copyright">
Made with
<a href="https://squidfunk.github.io/mkdocs-material/" rel="noopener" target="_blank">
Material for MkDocs
</a>
</div>
</div>
</div>
</footer>
</div>
<div class="md-dialog" data-md-component="dialog">
<div class="md-dialog__inner md-typeset"></div>
</div>
<script id="__config" type="application/json">{"annotate": null, "base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.2c215733.min.js", "tags": null, "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}, "version": null}</script>
<script src="../../assets/javascripts/bundle.79ae519e.min.js"></script>
<script src="https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js"></script>
</body>
</html>