docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
@@ -243,19 +243,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -347,6 +356,18 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../evidence_elimination/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -383,6 +404,30 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../merge/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../rebuild_filter/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -454,19 +499,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -506,68 +560,77 @@
|
||||
<div class="md-content" data-md-component="content">
|
||||
<article class="md-content__inner md-typeset">
|
||||
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
|
||||
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
|
||||
<h2 id="output-type-rope">Output type: rope</h2>
|
||||
<p>Each chunk is a <code>Vec<Bytes></code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
|
||||
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="allocation-policy">Allocation policy</h2>
|
||||
<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
|
||||
<h2 id="two-reading-paths">Two reading paths</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Case</th>
|
||||
<th>Cost</th>
|
||||
<th>Path</th>
|
||||
<th>API</th>
|
||||
<th>Output unit</th>
|
||||
<th>Per-record identity</th>
|
||||
<th>Use case</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Boundary found in the current block (common)</td>
|
||||
<td>zero extra allocation — <code>split_to</code> only</td>
|
||||
<td><strong>Record path</strong></td>
|
||||
<td><code>read_sequence_chunks</code> → <code>parse_chunk</code></td>
|
||||
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
|
||||
<td>yes</td>
|
||||
<td><code>query</code> — must read complete records</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Boundary straddles multiple blocks (sequence > block size, rare)</td>
|
||||
<td>one allocation to pack the rope into a flat buffer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>EOF flush</td>
|
||||
<td>zero extra allocation</td>
|
||||
<td><strong>Stream path</strong></td>
|
||||
<td><code>open_nuc_stream</code></td>
|
||||
<td><code>NucPage</code> (flat normalised byte buffer)</td>
|
||||
<td>no</td>
|
||||
<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
|
||||
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
|
||||
<hr/>
|
||||
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
|
||||
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec<SeqRecord></code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
|
||||
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
|
||||
<h2 id="output-type-rope">Output type: Rope</h2>
|
||||
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec<Cell<u8>></code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
|
||||
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="seqchunkiter">SeqChunkIter</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
|
||||
|
||||
<span class="k">impl</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">Bytes</span><span class="o">>></span><span class="p">;</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="n">Rope</span><span class="o">></span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>next()</code> loop:</p>
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
|
||||
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
||||
last block, skip the splitter (avoids a full backward scan for nothing)
|
||||
3. call splitter on last block
|
||||
if found at offset n:
|
||||
remainder = last_block.split_to(n) ← O(1), zero copy
|
||||
return std::mem::take(&mut self.rope) ← the chunk
|
||||
4. if rope.len() > 1 (multi-block accumulation):
|
||||
pack rope into one flat buffer ← one alloc
|
||||
retry splitter on flat buffer
|
||||
5. if EOF: flush remaining rope as final chunk
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
|
||||
2. call splitter(rope) → Option<abs_offset>
|
||||
if Some(pos):
|
||||
tail = rope.split_off(pos) ← O(log n), may split one block
|
||||
chunk = mem::replace(&mut rope, tail)
|
||||
return Some(Ok(chunk))
|
||||
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
|
||||
4. if EOF and rope empty: return None
|
||||
</code></pre></div>
|
||||
<p>The <code>Splitter</code> function signature is <code>fn(&Rope) -> Option<usize></code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
|
||||
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
|
||||
<p>Backward scan with a 2-state machine. Searches for <code>></code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
|
||||
<p>Backward scan with a 2-state machine. Searches (right to left) for <code>></code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>></code> that is preceded by a newline in forward order):</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
[*] --> Scanning
|
||||
Scanning --> FoundGt : '>'
|
||||
FoundGt --> Scanning : other
|
||||
FoundGt --> [*] : '\\n' / '\\r' ✓</code></pre>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record.</p>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record. Returns <code>None</code> if only one <code>></code> is found (cannot confirm there is a prior complete record).</p>
|
||||
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
|
||||
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
|
||||
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<p>7-state machine (states 0–6), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
|
||||
|
||||
Reference in New Issue
Block a user