docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -243,19 +243,28 @@
    </label>
 <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
 <li class="md-nav__item">
-<a class="md-nav__link" href="#output-type-rope">
+<a class="md-nav__link" href="#two-reading-paths">
 <span class="md-ellipsis">
      
-        Output type: rope
+        Two reading paths
      
    </span>
 </a>
 </li>
 <li class="md-nav__item">
-<a class="md-nav__link" href="#allocation-policy">
+<a class="md-nav__link" href="#record-path-chunk-reader">
 <span class="md-ellipsis">
      
-        Allocation policy
+        Record path: chunk reader
+      
+    </span>
+</a>
+</li>
+<li class="md-nav__item">
+<a class="md-nav__link" href="#output-type-rope">
+<span class="md-ellipsis">
+      
+        Output type: Rope
      
    </span>
 </a>
@@ -347,6 +356,18 @@
  

    
+  </span>
+</a>
+</li>
+<li class="md-nav__item">
+<a class="md-nav__link" href="../evidence_elimination/">
+<span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
  </span>
 </a>
 </li>
@@ -383,6 +404,30 @@
  

    
+  </span>
+</a>
+</li>
+<li class="md-nav__item">
+<a class="md-nav__link" href="../merge/">
+<span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+</a>
+</li>
+<li class="md-nav__item">
+<a class="md-nav__link" href="../rebuild_filter/">
+<span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
  </span>
 </a>
 </li>
@@ -454,19 +499,28 @@
    </label>
 <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
 <li class="md-nav__item">
-<a class="md-nav__link" href="#output-type-rope">
+<a class="md-nav__link" href="#two-reading-paths">
 <span class="md-ellipsis">
      
-        Output type: rope
+        Two reading paths
      
    </span>
 </a>
 </li>
 <li class="md-nav__item">
-<a class="md-nav__link" href="#allocation-policy">
+<a class="md-nav__link" href="#record-path-chunk-reader">
 <span class="md-ellipsis">
      
-        Allocation policy
+        Record path: chunk reader
+      
+    </span>
+</a>
+</li>
+<li class="md-nav__item">
+<a class="md-nav__link" href="#output-type-rope">
+<span class="md-ellipsis">
+      
+        Output type: Rope
      
    </span>
 </a>
@@ -506,68 +560,77 @@
 <div class="md-content" data-md-component="content">
 <article class="md-content__inner md-typeset">
 <h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
-<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
-<h2 id="output-type-rope">Output type: rope</h2>
-<p>Each chunk is a <code>Vec&lt;Bytes&gt;</code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
-<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
-<h2 id="allocation-policy">Allocation policy</h2>
+<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
+<h2 id="two-reading-paths">Two reading paths</h2>
 <table>
 <thead>
 <tr>
-<th>Case</th>
-<th>Cost</th>
+<th>Path</th>
+<th>API</th>
+<th>Output unit</th>
+<th>Per-record identity</th>
+<th>Use case</th>
 </tr>
 </thead>
 <tbody>
 <tr>
-<td>Boundary found in the current block (common)</td>
-<td>zero extra allocation — <code>split_to</code> only</td>
+<td><strong>Record path</strong></td>
+<td><code>read_sequence_chunks</code> → <code>parse_chunk</code></td>
+<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
+<td>yes</td>
+<td><code>query</code> — must read complete records</td>
 </tr>
 <tr>
-<td>Boundary straddles multiple blocks (sequence &gt; block size, rare)</td>
-<td>one allocation to pack the rope into a flat buffer</td>
-</tr>
-<tr>
-<td>EOF flush</td>
-<td>zero extra allocation</td>
+<td><strong>Stream path</strong></td>
+<td><code>open_nuc_stream</code></td>
+<td><code>NucPage</code> (flat normalised byte buffer)</td>
+<td>no</td>
+<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
 </tr>
 </tbody>
 </table>
+<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
+The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
+<hr/>
+<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
+<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec&lt;SeqRecord&gt;</code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
+<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
+<h2 id="output-type-rope">Output type: Rope</h2>
+<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec&lt;Cell&lt;u8&gt;&gt;</code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
+<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
 <h2 id="seqchunkiter">SeqChunkIter</h2>
 <div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>

 <span class="k">impl</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span>
-<span class="w">    </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Bytes</span><span class="o">&gt;&gt;</span><span class="p">;</span>
+<span class="w">    </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o">&lt;</span><span class="n">Rope</span><span class="o">&gt;</span><span class="p">;</span>
 <span class="p">}</span>

 <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
 <span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o">&lt;</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">&gt;</span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-&gt;</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o">&lt;</span><span class="n">R</span><span class="o">&gt;</span>
 </code></pre></div>
 <p><code>next()</code> loop:</p>
-<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
-2. probe check: if the boundary marker ("\n&gt;" or "\n@") is absent from the
-   last block, skip the splitter (avoids a full backward scan for nothing)
-3. call splitter on last block
-   if found at offset n:
-       remainder = last_block.split_to(n)    ← O(1), zero copy
-       return std::mem::take(&amp;mut self.rope)  ← the chunk
-4. if rope.len() &gt; 1 (multi-block accumulation):
-       pack rope into one flat buffer         ← one alloc
-       retry splitter on flat buffer
-5. if EOF: flush remaining rope as final chunk
+<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
+2. call splitter(rope) → Option&lt;abs_offset&gt;
+   if Some(pos):
+       tail = rope.split_off(pos)    ← O(log n), may split one block
+       chunk = mem::replace(&amp;mut rope, tail)
+       return Some(Ok(chunk))
+3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
+4. if EOF and rope empty: return None
 </code></pre></div>
+<p>The <code>Splitter</code> function signature is <code>fn(&amp;Rope) -&gt; Option&lt;usize&gt;</code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
 <h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
-<p>Backward scan with a 2-state machine. Searches for <code>&gt;</code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
+<p>Backward scan with a 2-state machine. Searches (right to left) for <code>&gt;</code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>&gt;</code> that is preceded by a newline in forward order):</p>
 <pre class="mermaid"><code>stateDiagram-v2
    direction LR
    [*]      --&gt; Scanning
    Scanning --&gt; FoundGt  : '&gt;'
    FoundGt  --&gt; Scanning : other
    FoundGt  --&gt; [*]      : '\\n' / '\\r' ✓</code></pre>
-<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record.</p>
+<p>Returns the byte offset of the <code>&gt;</code> that starts the last complete record. Returns <code>None</code> if only one <code>&gt;</code> is found (cannot confirm there is a prior complete record).</p>
 <h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
 <p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
-<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
+<p>7-state machine (states 0–6), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
 <pre class="mermaid"><code>stateDiagram-v2
    direction LR