Files
obikmer/doc/implementation/chunkreader/index.html
T
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00

578 lines
20 KiB
HTML

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="../kmer/" rel="prev"/>
<link href="../pipeline/" rel="next"/>
<link href="../../assets/images/favicon.png" rel="icon"/>
<meta content="mkdocs-1.6.1, mkdocs-material-9.7.6" name="generator"/>
<title>Chunk reader - obikmer</title>
<link href="../../assets/stylesheets/main.484c7ddc.min.css" rel="stylesheet"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&amp;display=fallback" rel="stylesheet"/>
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
</head>
<body dir="ltr">
<input autocomplete="off" class="md-toggle" data-md-toggle="drawer" id="__drawer" type="checkbox"/>
<input autocomplete="off" class="md-toggle" data-md-toggle="search" id="__search" type="checkbox"/>
<label class="md-overlay" for="__drawer"></label>
<div data-md-component="skip">
<a class="md-skip" href="#chunk-reader-implementation">
Skip to content
</a>
</div>
<div data-md-component="announce">
</div>
<header class="md-header md-header--shadow" data-md-component="header">
<nav aria-label="Header" class="md-header__inner md-grid">
<a aria-label="obikmer" class="md-header__button md-logo" data-md-component="logo" href="../.." title="obikmer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
</a>
<label class="md-header__button md-icon" for="__drawer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"></path></svg>
</label>
<div class="md-header__title" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
obikmer
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
Chunk reader
</span>
</div>
</div>
</div>
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
</nav>
</header>
<div class="md-container" data-md-component="container">
<main class="md-main" data-md-component="main">
<div class="md-main__inner md-grid">
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation">
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav aria-label="Navigation" class="md-nav md-nav--primary" data-md-level="0">
<label class="md-nav__title" for="__drawer">
<a aria-label="obikmer" class="md-nav__button md-logo" data-md-component="logo" href="../.." title="obikmer">
<svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"></path></svg>
</a>
obikmer
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../..">
<span class="md-ellipsis">
Home
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle" id="__nav_2" type="checkbox"/>
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
<span class="md-ellipsis">
Theory
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="false" aria-labelledby="__nav_2_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_2">
<span class="md-nav__icon md-icon"></span>
Theory
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../../kmers/">
<span class="md-ellipsis">
Kmers and super-kmers
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/encoding/">
<span class="md-ellipsis">
DNA encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/entropy/">
<span class="md-ellipsis">
Entropy filter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/minimizer/">
<span class="md-ellipsis">
Minimizer selection
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../../theory/indexing/">
<span class="md-ellipsis">
Partitioning architecture
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
<input checked="" class="md-nav__toggle md-toggle" id="__nav_3" type="checkbox"/>
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
<span class="md-ellipsis">
Implementation
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="true" aria-labelledby="__nav_3_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
Implementation
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../superkmer/">
<span class="md-ellipsis">
SuperKmer
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../kmer/">
<span class="md-ellipsis">
Kmer
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" id="__toc" type="checkbox"/>
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
Chunk reader
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a class="md-nav__link md-nav__link--active" href="./">
<span class="md-ellipsis">
Chunk reader
</span>
</a>
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: rope
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy">
<span class="md-ellipsis">
Allocation policy
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#seqchunkiter">
<span class="md-ellipsis">
SeqChunkIter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fasta">
<span class="md-ellipsis">
Boundary detection — FASTA
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fastq">
<span class="md-ellipsis">
Boundary detection — FASTQ
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../pipeline/">
<span class="md-ellipsis">
Construction pipeline
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../obipipeline/">
<span class="md-ellipsis">
obipipeline library
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../storage/">
<span class="md-ellipsis">
On-disk storage
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../mphf/">
<span class="md-ellipsis">
MPHF selection
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../unitig_evidence/">
<span class="md-ellipsis">
Unitig evidence encoding
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle" id="__nav_4" type="checkbox"/>
<label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="0">
<span class="md-ellipsis">
Architecture
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav aria-expanded="false" aria-labelledby="__nav_4_label" class="md-nav" data-md-level="1">
<label class="md-nav__title" for="__nav_4">
<span class="md-nav__icon md-icon"></span>
Architecture
</label>
<ul class="md-nav__list" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="../../architecture/sequences/invariant/">
<span class="md-ellipsis">
Sequences
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc">
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav aria-label="Table of contents" class="md-nav md-nav--secondary">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: rope
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy">
<span class="md-ellipsis">
Allocation policy
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#seqchunkiter">
<span class="md-ellipsis">
SeqChunkIter
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fasta">
<span class="md-ellipsis">
Boundary detection — FASTA
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#boundary-detection-fastq">
<span class="md-ellipsis">
Boundary detection — FASTQ
</span>
</a>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
<h2 id="output-type-rope">Output type: rope</h2>
<p>Each chunk is a <code>Vec&lt;Bytes&gt;</code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
<h2 id="allocation-policy">Allocation policy</h2>
<table>
<thead>
<tr>
<th>Case</th>
<th>Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>Boundary found in the current block (common)</td>
<td>zero extra allocation — <code>split_to</code> only</td>
</tr>
<tr>
<td>Boundary straddles multiple blocks (sequence &gt; block size, rare)</td>
<td>one allocation to pack the rope into a flat buffer</td>
</tr>
<tr>
<td>EOF flush</td>
<td>zero extra allocation</td>
</tr>
</tbody>
</table>
<h2 id="seqchunkiter">SeqChunkIter</h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Bytes</span><span class="o">&gt;&gt;</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
</code></pre></div>
<p><code>next()</code> loop:</p>
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
2. probe check: if the boundary marker ("\n&gt;" or "\n@") is absent from the
last block, skip the splitter (avoids a full backward scan for nothing)
3. call splitter on last block
if found at offset n:
remainder = last_block.split_to(n) ← O(1), zero copy
return std::mem::take(&amp;mut self.rope) ← the chunk
4. if rope.len() &gt; 1 (multi-block accumulation):
pack rope into one flat buffer ← one alloc
retry splitter on flat buffer
5. if EOF: flush remaining rope as final chunk
</code></pre></div>
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
<p>Backward scan with a 2-state machine. Searches for <code>&gt;</code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR
[*] --&gt; Scanning
Scanning --&gt; FoundGt : '&gt;'
FoundGt --&gt; Scanning : other
FoundGt --&gt; [*] : '\\n' / '\\r' ✓</code></pre>
<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record.</p>
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR
[*] --&gt; Scanning
Scanning --&gt; FoundPlus : '+' (save restart)
FoundPlus --&gt; AfterNlPlus : '\\n' / '\\r'
FoundPlus --&gt; Scanning : other → backtrack
AfterNlPlus --&gt; AfterNlPlus : séparateur
AfterNlPlus --&gt; InSequence : lettre / - / . / [ / ]
AfterNlPlus --&gt; Scanning : other → backtrack
InSequence --&gt; AfterSequence : '\\n' / '\\r'
InSequence --&gt; InSequence : lettre / - / . / [ / ]
InSequence --&gt; Scanning : other → backtrack
AfterSequence --&gt; AfterSequence : '\\n' / '\\r'
AfterSequence --&gt; InHeader : other
InHeader --&gt; FoundAt : '@' (save cut)
InHeader --&gt; Scanning : '\\n' / '\\r' → backtrack
InHeader --&gt; InHeader : other
FoundAt --&gt; [*] : '\\n' / '\\r' ✓
FoundAt --&gt; InHeader : other</code></pre>
<p><code>restart</code> is updated each time a <code>+</code> is found. When any state fails its expected input, the scan jumps back to <code>restart</code> and continues from there — guaranteeing that a <code>@</code> in a quality line cannot be accepted as a record start, because the <code>\n+\n</code> structure immediately following it (going backward) will not be found.</p>
<p>Returns the byte offset of the <code>@</code> that starts the last complete record.</p>
</article>
</div>
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
</div>
</main>
<footer class="md-footer">
<div class="md-footer-meta md-typeset">
<div class="md-footer-meta__inner md-grid">
<div class="md-copyright">
Made with
<a href="https://squidfunk.github.io/mkdocs-material/" rel="noopener" target="_blank">
Material for MkDocs
</a>
</div>
</div>
</div>
</footer>
</div>
<div class="md-dialog" data-md-component="dialog">
<div class="md-dialog__inner md-typeset"></div>
</div>
<script id="__config" type="application/json">{"annotate": null, "base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.2c215733.min.js", "tags": null, "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}, "version": null}</script>
<script src="../../assets/javascripts/bundle.79ae519e.min.js"></script>
<script src="https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js"></script>
</body>
</html>