docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
Eric Coissac
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
+101 -38
View File
@@ -243,19 +243,28 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis">
Output type: rope
Two reading paths
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy">
<a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis">
Allocation policy
Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span>
</a>
@@ -347,6 +356,18 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../evidence_elimination/">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
@@ -383,6 +404,30 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../merge/">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="../rebuild_filter/">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
@@ -454,19 +499,28 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<a class="md-nav__link" href="#two-reading-paths">
<span class="md-ellipsis">
Output type: rope
Two reading paths
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#allocation-policy">
<a class="md-nav__link" href="#record-path-chunk-reader">
<span class="md-ellipsis">
Allocation policy
Record path: chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#output-type-rope">
<span class="md-ellipsis">
Output type: Rope
</span>
</a>
@@ -506,68 +560,77 @@
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
<h2 id="output-type-rope">Output type: rope</h2>
<p>Each chunk is a <code>Vec&lt;Bytes&gt;</code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
<h2 id="allocation-policy">Allocation policy</h2>
<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
<h2 id="two-reading-paths">Two reading paths</h2>
<table>
<thead>
<tr>
<th>Case</th>
<th>Cost</th>
<th>Path</th>
<th>API</th>
<th>Output unit</th>
<th>Per-record identity</th>
<th>Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td>Boundary found in the current block (common)</td>
<td>zero extra allocation — <code>split_to</code> only</td>
<td><strong>Record path</strong></td>
<td><code>read_sequence_chunks</code><code>parse_chunk</code></td>
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
<td>yes</td>
<td><code>query</code> — must read complete records</td>
</tr>
<tr>
<td>Boundary straddles multiple blocks (sequence &gt; block size, rare)</td>
<td>one allocation to pack the rope into a flat buffer</td>
</tr>
<tr>
<td>EOF flush</td>
<td>zero extra allocation</td>
<td><strong>Stream path</strong></td>
<td><code>open_nuc_stream</code></td>
<td><code>NucPage</code> (flat normalised byte buffer)</td>
<td>no</td>
<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
</tr>
</tbody>
</table>
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
<hr/>
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec&lt;SeqRecord&gt;</code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
<h2 id="output-type-rope">Output type: Rope</h2>
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec&lt;Cell&lt;u8&gt;&gt;</code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
<h2 id="seqchunkiter">SeqChunkIter</h2>
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
<span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Bytes</span><span class="o">&gt;&gt;</span><span class="p">;</span>
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Rope</span><span class="o">&gt;</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
</code></pre></div>
<p><code>next()</code> loop:</p>
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
2. probe check: if the boundary marker ("\n&gt;" or "\n@") is absent from the
last block, skip the splitter (avoids a full backward scan for nothing)
3. call splitter on last block
if found at offset n:
remainder = last_block.split_to(n) ← O(1), zero copy
return std::mem::take(&amp;mut self.rope) ← the chunk
4. if rope.len() &gt; 1 (multi-block accumulation):
pack rope into one flat buffer ← one alloc
retry splitter on flat buffer
5. if EOF: flush remaining rope as final chunk
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
2. call splitter(rope) → Option&lt;abs_offset&gt;
if Some(pos):
tail = rope.split_off(pos) ← O(log n), may split one block
chunk = mem::replace(&amp;mut rope, tail)
return Some(Ok(chunk))
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
4. if EOF and rope empty: return None
</code></pre></div>
<p>The <code>Splitter</code> function signature is <code>fn(&amp;Rope) -&gt; Option&lt;usize&gt;</code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
<p>Backward scan with a 2-state machine. Searches for <code>&gt;</code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
<p>Backward scan with a 2-state machine. Searches (right to left) for <code>&gt;</code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>&gt;</code> that is preceded by a newline in forward order):</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR
[*] --&gt; Scanning
Scanning --&gt; FoundGt : '&gt;'
FoundGt --&gt; Scanning : other
FoundGt --&gt; [*] : '\\n' / '\\r' ✓</code></pre>
<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record.</p>
<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record. Returns <code>None</code> if only one <code>&gt;</code> is found (cannot confirm there is a prior complete record).</p>
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
<p>7-state machine (states 06), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
<pre class="mermaid"><code>stateDiagram-v2
direction LR