docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
@@ -638,6 +638,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="/implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="/implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -716,6 +744,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="/implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="/implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../../../implementation/persistent_bit_vec/">
|
||||
<link rel="prev" href="../../../implementation/rebuild_filter/">
|
||||
|
||||
|
||||
<link rel="next" href="../../index_architecture/">
|
||||
@@ -647,6 +647,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -725,6 +753,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -243,19 +243,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -347,6 +356,18 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../evidence_elimination/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -383,6 +404,30 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../merge/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../rebuild_filter/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -454,19 +499,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -506,68 +560,77 @@
|
||||
<div class="md-content" data-md-component="content">
|
||||
<article class="md-content__inner md-typeset">
|
||||
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
|
||||
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
|
||||
<h2 id="output-type-rope">Output type: rope</h2>
|
||||
<p>Each chunk is a <code>Vec<Bytes></code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
|
||||
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="allocation-policy">Allocation policy</h2>
|
||||
<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
|
||||
<h2 id="two-reading-paths">Two reading paths</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Case</th>
|
||||
<th>Cost</th>
|
||||
<th>Path</th>
|
||||
<th>API</th>
|
||||
<th>Output unit</th>
|
||||
<th>Per-record identity</th>
|
||||
<th>Use case</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Boundary found in the current block (common)</td>
|
||||
<td>zero extra allocation — <code>split_to</code> only</td>
|
||||
<td><strong>Record path</strong></td>
|
||||
<td><code>read_sequence_chunks</code> → <code>parse_chunk</code></td>
|
||||
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
|
||||
<td>yes</td>
|
||||
<td><code>query</code> — must read complete records</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Boundary straddles multiple blocks (sequence > block size, rare)</td>
|
||||
<td>one allocation to pack the rope into a flat buffer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>EOF flush</td>
|
||||
<td>zero extra allocation</td>
|
||||
<td><strong>Stream path</strong></td>
|
||||
<td><code>open_nuc_stream</code></td>
|
||||
<td><code>NucPage</code> (flat normalised byte buffer)</td>
|
||||
<td>no</td>
|
||||
<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
|
||||
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
|
||||
<hr/>
|
||||
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
|
||||
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec<SeqRecord></code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
|
||||
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
|
||||
<h2 id="output-type-rope">Output type: Rope</h2>
|
||||
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec<Cell<u8>></code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
|
||||
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="seqchunkiter">SeqChunkIter</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
|
||||
|
||||
<span class="k">impl</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">Bytes</span><span class="o">>></span><span class="p">;</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="n">Rope</span><span class="o">></span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>next()</code> loop:</p>
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
|
||||
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
||||
last block, skip the splitter (avoids a full backward scan for nothing)
|
||||
3. call splitter on last block
|
||||
if found at offset n:
|
||||
remainder = last_block.split_to(n) ← O(1), zero copy
|
||||
return std::mem::take(&mut self.rope) ← the chunk
|
||||
4. if rope.len() > 1 (multi-block accumulation):
|
||||
pack rope into one flat buffer ← one alloc
|
||||
retry splitter on flat buffer
|
||||
5. if EOF: flush remaining rope as final chunk
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
|
||||
2. call splitter(rope) → Option<abs_offset>
|
||||
if Some(pos):
|
||||
tail = rope.split_off(pos) ← O(log n), may split one block
|
||||
chunk = mem::replace(&mut rope, tail)
|
||||
return Some(Ok(chunk))
|
||||
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
|
||||
4. if EOF and rope empty: return None
|
||||
</code></pre></div>
|
||||
<p>The <code>Splitter</code> function signature is <code>fn(&Rope) -> Option<usize></code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
|
||||
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
|
||||
<p>Backward scan with a 2-state machine. Searches for <code>></code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
|
||||
<p>Backward scan with a 2-state machine. Searches (right to left) for <code>></code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>></code> that is preceded by a newline in forward order):</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
[*] --> Scanning
|
||||
Scanning --> FoundGt : '>'
|
||||
FoundGt --> Scanning : other
|
||||
FoundGt --> [*] : '\\n' / '\\r' ✓</code></pre>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record.</p>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record. Returns <code>None</code> if only one <code>></code> is found (cannot confirm there is a prior complete record).</p>
|
||||
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
|
||||
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
|
||||
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<p>7-state machine (states 0–6), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -514,10 +514,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -558,10 +569,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -751,6 +784,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -829,6 +890,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -973,10 +1090,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1017,10 +1145,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1045,12 +1195,43 @@
|
||||
|
||||
|
||||
<h1 id="kmer-implementation">Kmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p><code>Kmer</code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code>:</p>
|
||||
<h2 id="types-and-layout">Types and layout</h2>
|
||||
<p><code>KmerOf<L></code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code> parameterized by a <code>KmerLength</code> marker:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="cp">#[repr(transparent)]</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Kmer</span><span class="p">(</span><span class="kt">u64</span><span class="p">);</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">KmerOf</span><span class="o"><</span><span class="n">L</span><span class="p">:</span><span class="w"> </span><span class="nc">KmerLength</span><span class="o">></span><span class="p">(</span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="n">PhantomData</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="p">);</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is <strong>not stored</strong> — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.</p>
|
||||
<p>Three marker types implement <code>KmerLength</code>:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Marker</th>
|
||||
<th><code>len()</code> source</th>
|
||||
<th>Used for</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>KLen</code></td>
|
||||
<td><code>params::k()</code></td>
|
||||
<td>k-mers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>MLen</code></td>
|
||||
<td><code>params::m()</code></td>
|
||||
<td>minimizers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ConstLen<N></code></td>
|
||||
<td>const generic <code>N</code></td>
|
||||
<td>tests</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Public aliases:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Kmer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KmerOf</span><span class="o"><</span><span class="n">KLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// k-mer, global k</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Minimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="o"><</span><span class="n">MLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// canonical m-mer, global m</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is <strong>not stored</strong> — every operation reads it from <code>L::len()</code>.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -1071,33 +1252,41 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="global-parameters">Global parameters</h2>
|
||||
<p><code>params::set_k(k)</code> / <code>params::k()</code> and <code>params::set_m(m)</code> / <code>params::m()</code> are backed by <code>OnceLock<usize></code> in production (write-once, panic on conflict) and by <code>thread_local! { Cell<usize> }</code> in test builds (per-thread, freely writable). <code>params::init(k, m)</code> sets both in one call.</p>
|
||||
<h2 id="encoding">Encoding</h2>
|
||||
<p><code>Kmer::from_ascii(ascii, k)</code> encodes the first k bytes of an ASCII slice using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<p><code>KmerOf::<L>::from_ascii(ascii)</code> encodes the first <code>L::len()</code> bytes using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">k</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">encode_base</span><span class="p">(</span><span class="n">ascii</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">u64</span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="decoding">Decoding</h2>
|
||||
<p><code>write_ascii(k, buf)</code> appends k ASCII characters to a caller-supplied <code>Vec<u8></code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii(k)</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<p><code>write_ascii(writer)</code> writes k ASCII characters to any <code>W: Write</code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii()</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<h2 id="reverse-complement">Reverse complement</h2>
|
||||
<p>Computed as pure arithmetic — no lookup table, no memory access:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="c1">// complement</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">swap_bytes</span><span class="p">();</span><span class="w"> </span><span class="c1">// reverse bytes</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap nibbles</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap 2-bit groups</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>After complementing, bytes are reversed (<code>swap_bytes</code>), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="canonical-form">Canonical form</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">Self</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">(</span><span class="n">k</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o">*</span><span class="bp">self</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="p">}</span>
|
||||
<h2 id="canonical-form-and-canonicalkmerof">Canonical form and <code>CanonicalKmerOf</code></h2>
|
||||
<p><code>canonical()</code> returns a <code>CanonicalKmerOf<L></code> — a distinct newtype that carries the same <code>u64</code> layout but enforces the invariant that the stored value equals <code>min(kmer, revcomp)</code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">CanonicalKmerOf</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">();</span>
|
||||
<span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Lexicographic minimum of forward and reverse-complement, comparing the raw <code>u64</code> values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.</p>
|
||||
<p><code>CanonicalKmerOf::from_raw_unchecked(raw)</code> is the only other public constructor, for trusted paths such as deserialisation.</p>
|
||||
<h2 id="sliding-window-helpers">Sliding window helpers</h2>
|
||||
<p><code>push_right(nuc)</code> / <code>push_left(nuc)</code> shift the window by one base in O(1). <code>is_overlapping(other)</code> checks whether the last k−1 nucleotides of <code>self</code> equal the first k−1 of <code>other</code>.</p>
|
||||
<h2 id="hashing">Hashing</h2>
|
||||
<p><code>hash_kmer(raw: u64) -> u64</code> computes <code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>, the seeded splitmix64 finalizer. <code>CanonicalKmerOf::seq_hash()</code> delegates to <code>hash_kmer</code>.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -757,6 +757,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-modes" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence modes
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-functions" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build functions
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -840,6 +862,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -918,6 +968,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1165,6 +1271,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-modes" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence modes
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-functions" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build functions
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1226,26 +1354,26 @@
|
||||
<h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
|
||||
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
|
||||
<h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
|
||||
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p>
|
||||
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code> → <code>count_partition()</code>.</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li>
|
||||
<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li>
|
||||
<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li>
|
||||
<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li>
|
||||
<li><strong>External sort</strong>: read the dereplicated superkmer file; extract the raw <code>u64</code> canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: <code>sorted_unique.bin</code> — a flat array of f0 distinct sorted <code>u64</code> values. Exact kmer count f0 is known at this point.</li>
|
||||
<li><strong>Build provisional MPHF</strong> (ptr_hash, same configuration as phase 2) over <code>sorted_unique.bin</code> using <code>new_from_par_iter</code>. Delete <code>sorted_unique.bin</code> immediately after. Persist to <code>mphf1.bin</code>.</li>
|
||||
<li><strong>Create <code>counts1.bin</code></strong>: <code>PersistentCompactIntVec</code> with f0 slots, zero-initialised.</li>
|
||||
<li><strong>Accumulation pass</strong>: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute <code>slot = mphf.index(kmer.raw())</code> and increment <code>counts1[slot]</code> by the superkmer's COUNT.</li>
|
||||
<li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
|
||||
</ol>
|
||||
<p>Files produced per partition:</p>
|
||||
<div class="highlight"><pre><span></span><code>part_XXXXX/
|
||||
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
|
||||
counts1.bin — [u32; n_kmers] kmer counts, mmap'd
|
||||
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
|
||||
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
|
||||
kmer_spectrum_raw.json — local frequency spectrum
|
||||
</code></pre></div>
|
||||
<h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
|
||||
<p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
|
||||
<p><code>MphfLayer::build</code> is called on the unitig file:</p>
|
||||
<p><code>MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)</code> is called on the unitig directory:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li>
|
||||
<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li>
|
||||
<li><strong>Pass 1</strong> (parallel): a <code>CanonicalKmerIter</code> — clonable via <code>Arc<Mmap></code>, no file reopening — is passed directly to <code>new_from_par_iter</code> via <code>par_bridge()</code>. No <code>.idx</code> is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces <code>mphf.bin</code>.</li>
|
||||
<li><strong>Pass 2</strong> (sequential): iterate with <code>iter_indexed_canonical_kmers</code>; fill evidence files; call <code>fill_slot(slot, kmer)</code> callback per kmer. For Exact/Hybrid, <code>.idx</code> is written at the end of this pass — never earlier.</li>
|
||||
</ol>
|
||||
<p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
|
||||
<hr />
|
||||
@@ -1265,13 +1393,11 @@
|
||||
<p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
|
||||
<ul>
|
||||
<li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
|
||||
<li>Works well from an exact or slightly overestimated count</li>
|
||||
<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
|
||||
<li><code>GOFunction</code> (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well</li>
|
||||
</ul>
|
||||
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
|
||||
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p>
|
||||
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p>
|
||||
<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
|
||||
<p><strong>Both phases</strong>: <strong>ptr_hash</strong>, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the <code>ph</code> crate dependency.</p>
|
||||
<p>boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.</p>
|
||||
<hr />
|
||||
<h2 id="space-at-scale">Space at scale</h2>
|
||||
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
|
||||
@@ -1320,9 +1446,12 @@
|
||||
<h3 id="layer-structure">Layer structure</h3>
|
||||
<p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
|
||||
<div class="highlight"><pre><span></span><code>layer_i/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
|
||||
unitigs.bin.idx — random-access block index (block_bits controls granularity)
|
||||
mphf.bin — ptr_hash phase-2 MPHF
|
||||
evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
|
||||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||||
[no layer_meta.json — mode stored once in partition-level meta.json]
|
||||
</code></pre></div>
|
||||
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
|
||||
<ol>
|
||||
@@ -1330,17 +1459,43 @@
|
||||
<li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
|
||||
<li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
|
||||
</ol>
|
||||
<h3 id="evidence-modes">Evidence modes</h3>
|
||||
<p>Three evidence modes are supported via <code>IndexMode</code>, stored once in <code>PartitionMeta</code> at partition root. There is no <code>layer_meta.json</code>.</p>
|
||||
<p><strong>Exact</strong> (<code>IndexMode::Exact</code>): <code>evidence.bin</code> stores one <code>(chunk_id, rank)</code> pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. <code>.idx</code> required at query time.</p>
|
||||
<p><strong>Approx</strong> (<code>IndexMode::Approx { b, z }</code>): <code>fingerprint.bin</code> stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No <code>.idx</code> written or needed.</p>
|
||||
<p><strong>Hybrid</strong> (<code>IndexMode::Hybrid { b, z }</code>): both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (O(1)); <code>find_strict()</code> uses exact evidence (O(1)).</p>
|
||||
<h3 id="build-functions">Build functions</h3>
|
||||
<div class="highlight"><pre><span></span><code>MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
|
||||
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
|
||||
Pass 2: sequential iter → fill evidence files + call fill_slot
|
||||
.idx written last for Exact/Hybrid (query-time only)
|
||||
|
||||
MphfLayer::build_exact_evidence(dir, block_bits)
|
||||
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); no .idx required on entry
|
||||
|
||||
MphfLayer::build_approx_evidence(dir, b, z)
|
||||
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); never writes .idx
|
||||
</code></pre></div>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper. Callers choose the appropriate post-hoc build directly.</p>
|
||||
<p>In <code>obikpartitionner</code>, <code>build_index_layer</code> receives <code>block_bits: u8</code> from <code>IndexConfig::block_bits</code> and forwards it directly to <code>Layer::build</code> and <code>Layer::build_approx_evidence</code>.</p>
|
||||
<h3 id="membership-verification">Membership verification</h3>
|
||||
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
|
||||
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:</p>
|
||||
<ul>
|
||||
<li><strong>Exact</strong>: decode <code>(chunk_id, rank)</code> from <code>evidence.bin</code>; reconstruct the kmer via <code>unitigs.verify_canonical_kmer</code>; compare to query.</li>
|
||||
<li><strong>Approx</strong>: compare <code>kmer.seq_hash()</code> to the b-bit fingerprint stored at the slot.</li>
|
||||
</ul>
|
||||
<p>A mismatch in either mode means the kmer is absent from this layer; probe the next layer.</p>
|
||||
<h3 id="query-algorithm">Query algorithm</h3>
|
||||
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option<(layer_index, slot)>:
|
||||
for (i, layer) in layers.iter().enumerate():
|
||||
slot = layer.mphf.index(kmer)
|
||||
if layer.evidence.decode(slot) matches kmer:
|
||||
if layer.evidence.matches(slot, kmer): // exact or approx dispatch
|
||||
return Some((i, slot))
|
||||
return None
|
||||
</code></pre></div>
|
||||
<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p>
|
||||
<p><code>MphfLayer::find</code> dispatches on <code>LayerEvidence</code> at O(1) — no panicking <code>find_exact</code>/<code>find_approx</code> methods. <code>find_strict</code> always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.</p>
|
||||
<h3 id="merging-layers">Merging layers</h3>
|
||||
<p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../unitig_evidence/">
|
||||
<link rel="prev" href="../evidence_elimination/">
|
||||
|
||||
|
||||
<link rel="next" href="../persistent_compact_int_vec/">
|
||||
@@ -649,6 +649,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item md-nav__item--active">
|
||||
@@ -729,6 +757,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Index mode (homogeneity invariant)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -740,6 +779,34 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-api" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query API
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-surface" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build surface
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -751,6 +818,73 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer\<D: LayerData> — MPHF + payload">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-signatures" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build signatures
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
FingerprintVec and FingerprintVecWriter
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
LayeredMap\<D> — collection of layers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="LayeredMap\<D> — collection of layers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#common-methods" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Common methods
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#push_layer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
push_layer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -776,10 +910,10 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-encoding" class="md-nav__link">
|
||||
<a href="#evidence-encoding-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence encoding
|
||||
Evidence encoding (exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -798,14 +932,53 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-path" class="md-nav__link">
|
||||
<a href="#column-append-and-merge-support" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query path
|
||||
Column append and merge support
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Column append and merge support">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-level-genome-column-append" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer-level genome column append
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-matrix-initialisation" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Presence matrix initialisation
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the MPHF is never rebuilt
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -895,6 +1068,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1058,6 +1287,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Index mode (homogeneity invariant)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1069,6 +1309,34 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-api" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query API
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-surface" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build surface
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1080,6 +1348,73 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer\<D: LayerData> — MPHF + payload">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-signatures" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build signatures
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
FingerprintVec and FingerprintVecWriter
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
LayeredMap\<D> — collection of layers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="LayeredMap\<D> — collection of layers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#common-methods" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Common methods
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#push_layer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
push_layer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1105,10 +1440,10 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-encoding" class="md-nav__link">
|
||||
<a href="#evidence-encoding-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence encoding
|
||||
Evidence encoding (exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1127,14 +1462,53 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-path" class="md-nav__link">
|
||||
<a href="#column-append-and-merge-support" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query path
|
||||
Column append and merge support
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Column append and merge support">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-level-genome-column-append" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer-level genome column append
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-matrix-initialisation" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Presence matrix initialisation
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the MPHF is never rebuilt
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1178,7 +1552,7 @@
|
||||
|
||||
<h1 id="obilayeredmap-layered-kmer-index-crate">obilayeredmap — layered kmer index crate</h1>
|
||||
<h2 id="purpose">Purpose</h2>
|
||||
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. The index is organised in three levels: <strong>index root → partition → layer</strong>. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
|
||||
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
|
||||
<hr />
|
||||
<h2 id="three-usage-modes">Three usage modes</h2>
|
||||
<p>The MPHF + evidence infrastructure is the same for all modes. The <strong>payload</strong> varies.</p>
|
||||
@@ -1214,34 +1588,65 @@
|
||||
</table>
|
||||
<p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate.</p>
|
||||
<hr />
|
||||
<h2 id="index-mode-homogeneity-invariant">Index mode (homogeneity invariant)</h2>
|
||||
<p>A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at <code>LayeredMap::open()</code> from <code>PartitionMeta.mode</code> and passed to each <code>Layer::open()</code> — no per-layer file is read.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="cp">#[derive(Serialize, Deserialize, Default)]</span>
|
||||
<span class="cp">#[serde(tag = </span><span class="s">"type"</span><span class="cp">, rename_all = </span><span class="s">"snake_case"</span><span class="cp">)]</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">IndexMode</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="cp">#[default]</span>
|
||||
<span class="w"> </span><span class="n">Exact</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>IndexMode</code> is stored once in <code>PartitionMeta</code> (<code>meta.json</code> at partition root). There is no <code>layer_meta.json</code>.</p>
|
||||
<ul>
|
||||
<li><strong>Exact</strong>: writes <code>evidence.bin</code> + <code>unitigs.bin.idx</code>. Zero false positives.</li>
|
||||
<li><strong>Approx</strong>: writes <code>fingerprint.bin</code> only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No <code>.idx</code> written or required.</li>
|
||||
<li><strong>Hybrid</strong>: writes both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (fast, O(1)); <code>find_strict()</code> uses exact evidence.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2 id="mphflayer-autonomous-kmer-slot-mapping">MphfLayer — autonomous kmer → slot mapping</h2>
|
||||
<p><code>MphfLayer</code> encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.</p>
|
||||
<p><code>MphfLayer</code> encapsulates the MPHF and evidence store for one layer. It is independent of any payload.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="c1">// number of indexed kmers = number of MPHF slots</span>
|
||||
<span class="w"> </span><span class="n">ev</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="p">,</span><span class="w"> </span><span class="c1">// loaded at open() time</span>
|
||||
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Public API:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">MphfLayer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span><span class="w"> </span><span class="c1">// Some(slot) or None</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">unitig_writer</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="n">UnitigFileWriter</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span>
|
||||
<span class="w"> </span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">fill_slot</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">mut</span><span class="w"> </span><span class="k">impl</span><span class="w"> </span><span class="nb">FnMut</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<p><code>LayerEvidence</code> is an internal enum, not public:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">enum</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">Exact</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs_path</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>find</code> returns <code>Some(slot)</code> only after verifying via evidence that the kmer is actually indexed. It returns <code>None</code> for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).</p>
|
||||
<p><code>build</code> runs two sequential passes over <code>unitigs.bin</code>:</p>
|
||||
<p><code>MphfLayer::open(dir, mode: &IndexMode)</code> receives the mode from <code>PartitionMeta</code> — no per-layer file is read.</p>
|
||||
<h3 id="query-api">Query API</h3>
|
||||
<p>Two public query methods, both returning <code>Option<usize></code> (slot index):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find_strict</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<ul>
|
||||
<li><code>find</code>: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.</li>
|
||||
<li><code>find_strict</code>: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no <code>.idx</code>).</li>
|
||||
</ul>
|
||||
<p>There are no <code>find_exact</code>/<code>find_approx</code> methods; panicking dispatch is eliminated.</p>
|
||||
<h3 id="build-surface">Build surface</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// Full MPHF + evidence build (two-pass)</span>
|
||||
<span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span><span class="w"> </span><span class="n">fill_slot</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
|
||||
<span class="c1">// Evidence-only post-hoc builds (MPHF already present)</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>MphfLayer::build</code> runs two passes over <code>unitigs.bin</code>:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: iterate all canonical kmers in parallel via rayon, construct and store <code>mphf.bin</code>. <code>new_from_par_iter</code> avoids materialising a full key <code>Vec</code>.</li>
|
||||
<li><strong>Pass 2</strong>: iterate again sequentially, fill <code>evidence.bin</code>, call <code>fill_slot(slot, kmer)</code> once per kmer for payload population. A compact <code>n/8</code>-byte seen-bitset verifies MPHF injectivity inline.</li>
|
||||
<li><strong>Pass 1</strong> (parallel via rayon): a <code>CanonicalKmerIter</code> (clonable, <code>Arc<Mmap></code>, no file reopening) is passed to <code>new_from_par_iter</code> via <code>par_bridge()</code>. Produces <code>mphf.bin</code>. No <code>.idx</code> is read or created at this stage.</li>
|
||||
<li><strong>Pass 2</strong> (sequential): fill evidence files; call <code>fill_slot(slot, kmer)</code> per kmer. <code>.idx</code> is written last for Exact/Hybrid modes (query-time only).</li>
|
||||
</ol>
|
||||
<p>For empty layers (n = 0), <code>build</code> returns <code>Ok(0)</code> immediately after creating empty <code>mphf.bin</code> and <code>evidence.bin</code>.</p>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper — callers invoke <code>build_exact_evidence</code> or <code>build_approx_evidence</code> directly.</p>
|
||||
<p>For empty layers (n = 0), all build variants return <code>Ok(0)</code> immediately after creating empty output files.</p>
|
||||
<hr />
|
||||
<h2 id="layerd-layerdata-mphf-payload">Layer\<D: LayerData> — MPHF + payload</h2>
|
||||
<p><code>Layer<D></code> pairs an <code>MphfLayer</code> with one payload store.</p>
|
||||
@@ -1261,7 +1666,7 @@
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">T</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not in the trait.</p>
|
||||
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not part of the trait.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -1288,28 +1693,89 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><strong>Build signatures:</strong></p>
|
||||
<h3 id="build-signatures">Build signatures</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 2</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 3</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentBitMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span>
|
||||
<span class="w"> </span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">n_genomes</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>All build impls delegate MPHF + evidence construction to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. Mode 2 pre-reads <code>n_kmers</code> from <code>unitigs.bin</code> to size the <code>PersistentCompactIntMatrixBuilder</code> before calling <code>MphfLayer::build</code>. Mode 3 does the same for <code>PersistentBitMatrixBuilder</code>.</p>
|
||||
<p>All build impls delegate to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. The <code>mode</code> parameter is forwarded directly — no <code>LayerMeta</code> is written.</p>
|
||||
<p>Evidence-only post-hoc builds are accessible directly on <code>Layer<D></code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="o"><</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="o">></span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper.</p>
|
||||
<hr />
|
||||
<h2 id="fingerprintvec-and-fingerprintvecwriter">FingerprintVec and FingerprintVecWriter</h2>
|
||||
<p>Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.</p>
|
||||
<div class="highlight"><pre><span></span><code>fingerprint.bin format:
|
||||
magic: b"FPVF" (4 bytes)
|
||||
b: u8 (bits per fingerprint, 1..=64)
|
||||
padding: [0u8; 3]
|
||||
n: u64 LE (number of slots)
|
||||
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">FingerprintVec</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">path</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">get</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u64</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">matches</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">b</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u8</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>matches(slot, hash)</code> extracts the b-bit fingerprint stored at <code>slot</code> and compares it to the low b bits of <code>hash</code>. It is the core operation of <code>find_approx</code>.</p>
|
||||
<hr />
|
||||
<h2 id="layeredmapd-collection-of-layers">LayeredMap\<D> — collection of layers</h2>
|
||||
<p><code>LayeredMap<D></code> wraps <code>Vec<Layer<D>></code> for a single partition directory.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">LayeredMap</span><span class="o"><</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">meta</span><span class="p">:</span><span class="w"> </span><span class="nc">PartitionMeta</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">layers</span><span class="p">:</span><span class="w"> </span><span class="nb">Vec</span><span class="o"><</span><span class="n">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">>></span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>PartitionMeta</code> (<code>meta.json</code> at the partition root) stores <code>n_layers</code>.</p>
|
||||
<h3 id="common-methods">Common methods</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">create</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n_layers</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">layer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kp">&</span><span class="nc">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">mode</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">Hit</span><span class="o"><</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">></span><span class="p">)</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">next_layer_writer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="n">UnitigFileWriter</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>open</code> reads <code>PartitionMeta</code> once, extracts <code>mode</code>, and passes it to every <code>Layer::open</code> — no per-layer file is read. <code>create</code> stores the given mode in <code>PartitionMeta</code>.</p>
|
||||
<p><code>query</code> probes layers in order and returns <code>(layer_index, Hit)</code> on the first match. Expected probe depth: 1 for kmers in layer 0.</p>
|
||||
<h3 id="push_layer">push_layer</h3>
|
||||
<p><code>push_layer</code> builds the next layer from a <code>unitigs.bin</code> already written via <code>next_layer_writer</code>, using <code>DEFAULT_BLOCK_BITS</code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 2</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer_from_map</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Mode 3 (<code>PersistentBitMatrix</code>) has no <code>push_layer</code> on <code>LayeredMap</code>; callers build directly via <code>Layer<PersistentBitMatrix>::build_presence</code>.</p>
|
||||
<hr />
|
||||
<h2 id="layeredstores-and-aggregation-traits">LayeredStore\<S> and aggregation traits</h2>
|
||||
<p><code>LayeredStore<S></code> is a generic aggregation wrapper over <code>Vec<S></code>. It propagates three traits from <code>obicompactvec::traits</code> up the hierarchy via blanket impls:</p>
|
||||
@@ -1320,11 +1786,6 @@
|
||||
<span class="k">impl</span><span class="o"><</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">BitPartials</span><span class="o">></span><span class="w"> </span><span class="n">BitPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o"><</span><span class="n">S</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">…</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span>
|
||||
</code></pre></div>
|
||||
<p>Because blanket impls compose, <code>LayeredStore<LayeredStore<S>></code> automatically inherits all three traits when <code>S</code> does — providing the partitioned level without a separate type.</p>
|
||||
<p><strong>Aggregation hierarchy:</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
||||
</code></pre></div>
|
||||
<p><strong>Leaf implementors</strong> (in <code>obicompactvec</code>):</p>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1344,69 +1805,77 @@ LayeredStore<LayeredStore<…>> via blanket impl
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><code>PersistentCompactIntVec</code> and <code>PersistentBitVec</code> do not implement these traits — they are single-column primitives, not matrix-level aggregators.</p>
|
||||
<p>See <a href="../../architecture/index_architecture/">Kmer index architecture</a> for the full trait API and the two-pass normalised-metric pattern.</p>
|
||||
<hr />
|
||||
<h2 id="on-disk-structure">On-disk structure</h2>
|
||||
<div class="highlight"><pre><span></span><code>index_root/ ← LayeredMap (collection)
|
||||
meta.json
|
||||
part_00000/ ← Partition
|
||||
<div class="highlight"><pre><span></span><code>partition_root/ ← LayeredMap (one partition)
|
||||
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
|
||||
layer_0/ ← Layer
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
||||
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
|
||||
evidence.bin — [u32; n], LE (Exact/Hybrid only)
|
||||
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
meta.json
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
…
|
||||
meta.json
|
||||
col_000000.pbiv …
|
||||
layer_1/
|
||||
…
|
||||
part_00001/
|
||||
…
|
||||
</code></pre></div>
|
||||
<p><strong>Partition</strong> (<code>part_XXXXX/</code>): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.</p>
|
||||
<p><strong>Layer</strong> (<code>layer_N/</code>): one <code>MphfLayer</code> plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.</p>
|
||||
<p>There is no <code>layer_meta.json</code>. The mode is stored once in <code>PartitionMeta</code> and is valid for all layers. <code>unitigs.bin.idx</code> is built at the end of <code>build_exact_evidence</code> — never during MPHF construction — and is consumed at query time only.</p>
|
||||
<hr />
|
||||
<h2 id="evidence-encoding">Evidence encoding</h2>
|
||||
<h2 id="evidence-encoding-exact">Evidence encoding (exact)</h2>
|
||||
<p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header. Each u32 encodes one slot:</p>
|
||||
<div class="highlight"><pre><span></span><code>bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
</code></pre></div>
|
||||
<p>Decoding: <code>chunk_id = raw >> 7</code>, <code>rank = raw & 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code>.</p>
|
||||
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.</p>
|
||||
<p><code>chunk_id = raw >> 7</code>, <code>rank = raw & 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code> (requires <code>unitigs.bin.idx</code> for random access).</p>
|
||||
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.</p>
|
||||
<hr />
|
||||
<h2 id="ptr_hash-configuration">ptr_hash configuration</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o"><</span>
|
||||
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span>
|
||||
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
|
||||
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">CachelineEf</span><span class="o">>></span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span>
|
||||
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">CachelineEf</span><span class="o">>></span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: Elias-Fano</span>
|
||||
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span>
|
||||
<span class="w"> </span><span class="nb">Vec</span><span class="o"><</span><span class="kt">u8</span><span class="o">></span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
|
||||
<span class="o">></span><span class="p">;</span>
|
||||
</code></pre></div>
|
||||
<p><code>Xx64</code> is chosen over <code>FxHash</code> because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.</p>
|
||||
<p><code>CubicEps</code> with <code>PtrHashParams::<CubicEps>::default()</code> (λ=3.5) is a balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space.</p>
|
||||
<p><code>CubicEps</code> with <code>PtrHashParams::<CubicEps>::default()</code> (λ=3.5): 2× slower construction than <code>Linear/λ=3.0</code>, ~20% less space.</p>
|
||||
<hr />
|
||||
<h2 id="query-path">Query path</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="n">Hit</span><span class="o"><</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">>></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">mphf</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">kmer</span><span class="p">).</span><span class="n">map</span><span class="p">(</span><span class="o">|</span><span class="n">slot</span><span class="o">|</span><span class="w"> </span><span class="n">Hit</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">slot</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">self</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">})</span>
|
||||
<h2 id="column-append-and-merge-support">Column append and merge support</h2>
|
||||
<p>These methods extend existing layers with new genome columns without touching the MPHF.</p>
|
||||
<h3 id="layer-level-genome-column-append">Layer-level genome column append</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentBitMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>MphfLayer::find</code> probes the MPHF, decodes evidence, and verifies the kmer — returning <code>Some(slot)</code> on match, <code>None</code> otherwise. <code>data.read(slot)</code> is called only on a confirmed hit.</p>
|
||||
<p>In <code>LayeredMap</code>, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.</p>
|
||||
<p>Both delegate to the corresponding <code>PersistentBitMatrix::append_column</code> / <code>PersistentCompactIntMatrix::append_column</code>. They write a new column file (<code>col_NNNNNN.pbiv</code> / <code>col_NNNNNN.pciv</code>) and update <code>meta.json</code> to increment <code>n_cols</code>. <code>value_of</code> is called once per slot (0..n).</p>
|
||||
<h3 id="presence-matrix-initialisation">Presence matrix initialisation</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">init_presence_matrix</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">n_kmers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Called on the first merge of a Presence-mode index. Creates <code>presence/</code> with <code>meta.json {"n": n_kmers, "n_cols": 1}</code> and <code>col_000000.pbiv</code> set entirely to <code>true</code>. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.</p>
|
||||
<h3 id="why-the-mphf-is-never-rebuilt">Why the MPHF is never rebuilt</h3>
|
||||
<p>The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new <code>.pciv</code>/<code>.pbiv</code> file and a single <code>meta.json</code> update.</p>
|
||||
<hr />
|
||||
<h2 id="add-layer-algorithm">Add-layer algorithm</h2>
|
||||
<p>When adding dataset B to an existing index:</p>
|
||||
<ol>
|
||||
<li>For each partition, probe existing layers for kmers of B routed to that partition.</li>
|
||||
<li>Collect kmers absent from all layers → <code>B \ index</code>.</li>
|
||||
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>MphfLayer::unitig_writer</code>.</li>
|
||||
<li>Call <code>Layer<D>::build</code> on the new directory.</li>
|
||||
<li>Update <code>meta.json</code>.</li>
|
||||
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>next_layer_writer()</code>.</li>
|
||||
<li>Call <code>Layer<D>::build</code> (or <code>build_presence</code>) on the new layer directory.</li>
|
||||
<li>Call <code>push_layer</code> (or <code>append_layer</code>) to register the new layer in <code>meta.json</code>.</li>
|
||||
</ol>
|
||||
<p>Each partition's new layer is built independently; the operation is fully parallel across partitions.</p>
|
||||
<hr />
|
||||
@@ -1433,11 +1902,15 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>memmap2 0.9</code></td>
|
||||
<td>mmap of evidence and payload files</td>
|
||||
<td>mmap of evidence and fingerprint files</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>bitvec</code></td>
|
||||
<td>packed b-bit fingerprint storage</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>obiskio</code></td>
|
||||
<td>unitig file writer/reader</td>
|
||||
<td>unitig file writer/reader + <code>.idx</code> build</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>obicompactvec</code></td>
|
||||
@@ -1448,8 +1921,8 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
<td>parallel MPHF construction pass</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ndarray 0.16</code></td>
|
||||
<td>aggregation output arrays</td>
|
||||
<td><code>serde / serde_json</code></td>
|
||||
<td><code>PartitionMeta</code> serialisation</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -662,6 +662,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#make_pipe-dsl" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
make_pipe! DSL
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
@@ -801,6 +812,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -879,6 +918,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1087,6 +1182,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#make_pipe-dsl" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
make_pipe! DSL
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
@@ -1145,7 +1251,7 @@
|
||||
|
||||
|
||||
<h1 id="obipipeline-parallel-pipeline-library">obipipeline — parallel pipeline library</h1>
|
||||
<p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>transforms</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p>
|
||||
<p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>stages</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p>
|
||||
<h2 id="core-types">Core types</h2>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1158,22 +1264,33 @@
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>SourceFn<D></code></td>
|
||||
<td><code>Box<dyn FnMut() -> Result<D, PipelineError> + Send+Sync></code></td>
|
||||
<td><code>Box<dyn FnMut() -> Result<D, PipelineError> + Send></code></td>
|
||||
<td>Called repeatedly; <code>FnMut</code> because it holds iterator state</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SharedFn<D></code></td>
|
||||
<td><code>Arc<dyn Fn(D) -> Result<D, PipelineError> + Send + Sync></code></td>
|
||||
<td>Shared across workers via <code>Arc::clone</code> (no copy of the closure)</td>
|
||||
<td>1→1 transform shared across workers via <code>Arc::clone</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SharedFlatFn<D></code></td>
|
||||
<td><code>Arc<dyn Fn(D, &Sender<Result<D, _>>, &Sender<isize>) + Send + Sync></code></td>
|
||||
<td>1→N transform; pushes items into channel, sends delta</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SinkFn<D></code></td>
|
||||
<td><code>Box<dyn Fn(D) -> Result<(), PipelineError> + Send+Sync></code></td>
|
||||
<td><code>Box<dyn Fn(D) -> Result<(), PipelineError> + Send></code></td>
|
||||
<td>Final consumer; returns <code>Result</code> so errors propagate back</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><code>Pipeline<D></code> holds one <code>SourceFn</code>, a <code>Vec<SharedFn></code>, and one <code>SinkFn</code>.<br />
|
||||
<p>Stages come in two variants:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">Stage</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">Transform</span><span class="p">(</span><span class="n">SharedFn</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→1</span>
|
||||
<span class="w"> </span><span class="n">Flat</span><span class="p">(</span><span class="n">SharedFlatFn</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→N</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>Pipeline<D></code> holds one <code>SourceFn</code>, a <code>Vec<Stage></code>, and one <code>SinkFn</code>.<br />
|
||||
<code>WorkerPool<D></code> wraps a <code>Pipeline</code> with <code>n_workers</code> and channel <code>capacity</code>.</p>
|
||||
<h2 id="workerpool">WorkerPool</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="n">WorkerPool</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">pipeline</span><span class="p">:</span><span class="w"> </span><span class="nc">Pipeline</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">,</span><span class="w"> </span><span class="n">n_workers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">capacity</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">Self</span>
|
||||
@@ -1193,7 +1310,7 @@
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>capacity</code></td>
|
||||
<td>Bound on every crossbeam channel in the pipeline (source output, inter-stage channels, worker input, sink input, sink error). Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td>
|
||||
<td>Bound on every crossbeam channel in the pipeline. Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
@@ -1208,7 +1325,7 @@
|
||||
</code></pre></div>
|
||||
<p>Each variant carries the concrete type for one stage's output. The macros pattern-match on this enum to route values between stages.</p>
|
||||
<h2 id="macros">Macros</h2>
|
||||
<p>Six low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p>
|
||||
<p>Eight low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p>
|
||||
<h3 id="low-level">Low-level</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="n">make_source</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields T</span>
|
||||
<span class="n">make_source_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields Result<T, E></span>
|
||||
@@ -1216,6 +1333,9 @@
|
||||
<span class="n">make_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> U</span>
|
||||
<span class="n">make_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<U, E></span>
|
||||
|
||||
<span class="n">make_flat_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> impl IntoIterator<Item=U></span>
|
||||
<span class="n">make_flat_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<impl IntoIterator<Item=U>, E></span>
|
||||
|
||||
<span class="n">make_sink</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> ()</span>
|
||||
<span class="n">make_sink_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<(), E></span>
|
||||
</code></pre></div>
|
||||
@@ -1224,17 +1344,31 @@
|
||||
<div class="highlight"><pre><span></span><code>make_pipeline! {
|
||||
DataEnum,
|
||||
source iterator => OutputVariant, // or source? for fallible
|
||||
| func: In => Out, // non-fallible transform
|
||||
|? func: In => Out, // fallible transform
|
||||
| func: In => Out, // 1→1 non-fallible transform
|
||||
|? func: In => Out, // 1→1 fallible transform
|
||||
|| func: In => Out, // 1→N non-fallible flat transform
|
||||
||? func: In => Out, // 1→N fallible flat transform
|
||||
sink func @ InputVariant, // or sink? for fallible
|
||||
}
|
||||
</code></pre></div>
|
||||
<p><code>?</code> marks fallibility on source, individual transforms, or sink independently.<br />
|
||||
Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</code> recurses over transform tokens one at a time, accumulating them into a <code>vec![]</code>, then terminates on <code>sink</code>/<code>sink?</code>.</p>
|
||||
<h3 id="make_pipe-dsl">make_pipe! DSL</h3>
|
||||
<p><code>make_pipe!</code> builds a sourceless/sinkless <code>Pipe<D, In, Out></code> — a reusable, composable stage sequence:</p>
|
||||
<div class="highlight"><pre><span></span><code>make_pipe! {
|
||||
DataEnum : InType => OutType,
|
||||
| func: InVariant => OutVariant,
|
||||
|? func: InVariant => OutVariant,
|
||||
|| func: InVariant => OutVariant,
|
||||
||? func: InVariant => OutVariant,
|
||||
}
|
||||
</code></pre></div>
|
||||
<p>Two pipes compose with <code>.then(other)</code>. Apply to an iterator with <code>.apply(iter, n_workers, capacity)</code> to get a <code>PipeIter<Out></code> — an iterator over the pipeline output, backed by a background <code>WorkerPool</code>. The scatter step in <code>obikmer</code> uses <code>make_pipe!</code> and <code>.apply()</code> rather than the full <code>make_pipeline!</code> / <code>WorkerPool</code> pattern.</p>
|
||||
<h2 id="scheduler-architecture">Scheduler architecture</h2>
|
||||
<div class="highlight"><pre><span></span><code>Source thread ──► [source_rx] ──► Scheduler ──► [worker_tx] ──► Workers (×N)
|
||||
▲ │
|
||||
[stage_rxs] ────────┘◄──────────────────────────────┘
|
||||
[flat_delta_rx] ──► Scheduler (in_flight adjustment)
|
||||
│
|
||||
[sink_err_rx] ← errors from sink (highest priority)
|
||||
│
|
||||
@@ -1242,20 +1376,20 @@ Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</co
|
||||
</code></pre></div>
|
||||
<p>The scheduler is a single thread running a biased <code>Select</code> over all input channels. Priority order (highest first):</p>
|
||||
<div class="highlight"><pre><span></span><code>index 0 sink_err_rx abort on sink error
|
||||
index 1 stage_rxs[N-1] drain last stage first
|
||||
...
|
||||
index N stage_rxs[0]
|
||||
index N+1 source_rx pull new data last
|
||||
index 1 flat_delta_rx adjust in_flight before dispatching
|
||||
index 2..=n+1 stage_rxs[n-1..0] drain last stage first
|
||||
index n+2 source_rx pull new data last
|
||||
</code></pre></div>
|
||||
<p>This back-pressure-friendly ordering ensures downstream stages are drained before new items enter the pipeline.</p>
|
||||
<p><strong>Workers</strong> are generic: each receives <code>(data, SharedFn, result_tx)</code> and calls <code>f(data)</code>, sending the result to the provided channel. The scheduler decides which transform to apply and where to route the result.</p>
|
||||
<p><strong>Termination</strong> uses an <code>in_flight</code> counter:</p>
|
||||
<p><strong>Workers</strong> are generic: each receives a <code>WorkerTask</code> — either <code>Transform(data, stage_idx)</code> or <code>Flat(data, stage_idx)</code>. For <code>Transform</code>, the worker calls <code>f(data)</code> and sends the result to <code>stage_txs[stage_idx]</code>. For <code>Flat</code>, the worker calls <code>f(data, &push_tx, &delta_tx)</code>: the closure pushes N items into <code>push_tx</code> then sends <code>N-1</code> to <code>delta_tx</code>. The scheduler uses the delta to adjust <code>in_flight</code> without knowing N in advance.</p>
|
||||
<p><strong>Termination</strong> uses an <code>in_flight: isize</code> counter and a <code>flat_workers_active: usize</code> counter:</p>
|
||||
<ul>
|
||||
<li>incremented when an item is dispatched from source to workers</li>
|
||||
<li>decremented when the item exits the last stage</li>
|
||||
<li>the loop exits only when <code>source_done && in_flight == 0</code></li>
|
||||
<li><code>in_flight</code> incremented when an item is dispatched from source to workers</li>
|
||||
<li><code>in_flight</code> decremented when the item exits the last stage to the sink</li>
|
||||
<li><code>flat_workers_active</code> incremented when a <code>Flat</code> task is dispatched, decremented when the delta arrives</li>
|
||||
<li>the loop exits only when <code>source_done && in_flight == 0 && flat_workers_active == 0</code></li>
|
||||
</ul>
|
||||
<p>This guarantees all in-flight items complete before <code>join()</code>.</p>
|
||||
<p>This guarantees all in-flight items complete (including all N outputs of a flat stage) before <code>join()</code>.</p>
|
||||
<h2 id="error-handling">Error handling</h2>
|
||||
<p><code>PipelineError</code> has four variants:</p>
|
||||
<table>
|
||||
@@ -1279,7 +1413,7 @@ index N+1 source_rx pull new data last
|
||||
<td>Internal routing error</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StepError(Box<dyn Error>)</code></td>
|
||||
<td><code>StepError(Box<dyn Error + Send + Sync>)</code></td>
|
||||
<td>Error from user code (wrapped by <code>make_*_fallible!</code>)</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -12,7 +12,7 @@
|
||||
<link rel="prev" href="../persistent_compact_int_vec/">
|
||||
|
||||
|
||||
<link rel="next" href="../../architecture/sequences/invariant/">
|
||||
<link rel="next" href="../merge/">
|
||||
|
||||
|
||||
|
||||
@@ -649,6 +649,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -1002,6 +1030,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -649,6 +649,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -985,6 +1013,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -773,6 +773,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -851,6 +879,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1104,7 +1188,9 @@
|
||||
<li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li>
|
||||
</ul>
|
||||
<h2 id="phase-1-scatter">Phase 1 — Scatter</h2>
|
||||
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:</p>
|
||||
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.</p>
|
||||
<p>Input files are read via <code>open_nuc_stream</code>, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of <code>NucPage</code> buffers. Each <code>NucPage</code> is a flat 64 KB buffer of normalised bytes (<code>ACGT</code> + <code>\x00</code> separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.</p>
|
||||
<p>For each read fragment within a page:</p>
|
||||
<ol>
|
||||
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li>
|
||||
<li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <span class="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <span class="arithmatex">\(S[j]\)</span> (with <span class="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <span class="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <span class="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li>
|
||||
@@ -1154,8 +1240,13 @@ B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
|
||||
for each kmer in sequence:
|
||||
kmer_counts[canonical(kmer)] += COUNT
|
||||
</code></pre></div>
|
||||
<p>Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.</p>
|
||||
<p>Abundance filter applied here: kmers with <code>total_count < q</code> are discarded. <code>q</code> is a collection parameter (0 = keep all, including singletons for ≤1x data).</p>
|
||||
<p>Implemented as a three-step pipeline in <code>count_partition()</code>:</p>
|
||||
<ol>
|
||||
<li><strong>External sort</strong> (<code>kmer_sort::sort_unique_kmers</code>): read dereplicated superkmers, extract canonical kmer raw <code>u64</code> values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → <code>sorted_unique.bin</code>. f0 is now known exactly.</li>
|
||||
<li><strong>Provisional MPHF</strong> (ptr_hash): built from <code>sorted_unique.bin</code> via <code>new_from_par_iter(f0, ...)</code>. Stored to <code>mphf1.bin</code>; <code>sorted_unique.bin</code> deleted immediately.</li>
|
||||
<li><strong>Accumulation pass</strong>: re-read dereplicated superkmers; for each kmer, <code>slot = mphf.index(kmer.raw())</code>, increment <code>counts1[slot]</code> by the superkmer COUNT. Stored in a <code>PersistentCompactIntVec</code> (<code>counts1.bin</code>).</li>
|
||||
</ol>
|
||||
<p>At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (<code>spectrums/{label}.json</code>) is written to the index root.</p>
|
||||
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p>
|
||||
<h2 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer compaction</h2>
|
||||
<p>The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:</p>
|
||||
@@ -1188,14 +1279,52 @@ branching / dead-end → unitig start or end
|
||||
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p>
|
||||
<p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p>
|
||||
<h2 id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2>
|
||||
<p>Built once on the definitive kmer set (all kmers in all unitigs of the partition). See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for the current implementation.</p>
|
||||
<div class="highlight"><pre><span></span><code>kmers from unitigs → MPHF → mphf.bin
|
||||
→ evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
|
||||
→ payload : counts/ (mode 2) or presence/ (mode 3)
|
||||
<p><code>build_index_layer</code> is called per partition (in parallel via <code>build_layers</code>) with the following parameters sourced from <code>IndexConfig</code>:</p>
|
||||
<ul>
|
||||
<li><code>block_bits</code> — from <code>IndexConfig::block_bits</code>; controls the <code>.idx</code> block size (2^block_bits unitig chunks per block) for exact evidence</li>
|
||||
<li><code>evidence</code> — <code>EvidenceKind::Exact</code> or <code>EvidenceKind::Approx { b, z }</code>; propagated unchanged from <code>IndexConfig::evidence</code></li>
|
||||
<li><code>min_ab</code> / <code>max_ab</code> — abundance bounds applied before graph construction</li>
|
||||
<li><code>with_counts</code> — whether to store kmer counts alongside set membership</li>
|
||||
</ul>
|
||||
<p><strong>Abundance filtering:</strong> when <code>min_ab > 1</code> or <code>max_ab.is_some()</code>, the provisional <code>mphf1.bin</code> and <code>counts1.bin</code> produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in <code>counts1</code> satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).</p>
|
||||
<div class="highlight"><pre><span></span><code>for each kmer in dereplicated super-kmer:
|
||||
ab = counts1[mphf1.index(kmer.raw())]
|
||||
if ab < min_ab || ab > max_ab: skip
|
||||
graph.push(kmer)
|
||||
</code></pre></div>
|
||||
<p>The MPHF is built in two passes over <code>unitigs.bin</code>: parallel pass for <code>mphf.bin</code>, sequential pass for <code>evidence.bin</code> and payload. The exact kmer count is available from the unitig index (<code>unitigs.bin.idx</code>) before the passes begin.</p>
|
||||
<p><strong>Exact verification via unitig evidence:</strong></p>
|
||||
<p><code>unitigs.bin</code> serves as the evidence structure. The MPHF maps every input to <code>[0, N)</code> including absent kmers — the unitig read-back (via <code>evidence.bin</code>) is the only correct membership test.</p>
|
||||
<p><strong>Graph build and unitig write:</strong></p>
|
||||
<p>The surviving kmers are fed into <code>GraphDeBruijn</code>, which computes degrees and yields unitigs. Unitigs are written to <code>layer_0/unitigs.bin</code> via a <code>UnitigFileWriter</code>.</p>
|
||||
<p><strong>MPHF and evidence build:</strong></p>
|
||||
<p><code>Layer::build</code> (membership-only) or <code>Layer::<PersistentCompactIntMatrix>::build</code> (with counts) is called next. Internally, <code>MphfLayer::build</code> performs two passes:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1 (parallel):</strong> build <code>unitigs.bin.idx</code> (block size = 2^<code>block_bits</code>) then construct the MPHF from all canonical kmers in <code>unitigs.bin</code>; store to <code>mphf.bin</code>.</li>
|
||||
<li><strong>Pass 2 (sequential):</strong> for each kmer in <code>unitigs.bin</code>, compute its slot and write <code>evidence.bin</code> (<code>chunk_id: 25 bits | rank: 7 bits</code> packed into a <code>u32</code>); also invoke the payload callback (<code>fill_slot</code>) to populate <code>counts/</code> if <code>with_counts</code>.</li>
|
||||
</ol>
|
||||
<p>After <code>Layer::build</code> completes, <code>layer_meta.json</code> records <code>EvidenceKind::Exact</code>.</p>
|
||||
<p><strong>Approximate evidence override:</strong></p>
|
||||
<p>If <code>evidence</code> is <code>EvidenceKind::Approx { b, z }</code>, <code>build_approx_evidence</code> is called immediately after <code>Layer::build</code>. It overwrites the exact evidence bundle with <code>fingerprint.bin</code> (b-bit hash per slot) and rewrites <code>layer_meta.json</code> with <code>EvidenceKind::Approx { b, z }</code>. No <code>.idx</code> file is needed at query time in this mode.</p>
|
||||
<div class="highlight"><pre><span></span><code>// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
|
||||
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
|
||||
// (evidence.bin left on disk but not used)
|
||||
</code></pre></div>
|
||||
<p><strong>Partition metadata:</strong></p>
|
||||
<p>After all layer files are written, <code>PartitionMeta { n_layers: 1 }</code> is serialised to <code>index/meta.json</code> inside the partition directory. This file is required by <code>LayeredMap::open</code> for subsequent merge operations.</p>
|
||||
<p><strong>File layout per partition after phase 6:</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>part_XXXXX/
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers: 1 }
|
||||
layer_0/
|
||||
unitigs.bin ← permanent evidence (all modes)
|
||||
unitigs.bin.idx ← block index (exact mode only)
|
||||
mphf.bin ← MPHF
|
||||
evidence.bin ← exact evidence (exact mode)
|
||||
fingerprint.bin ← b-bit fingerprints (approx mode)
|
||||
layer_meta.json ← EvidenceKind tag
|
||||
counts/ ← PersistentCompactIntMatrix (with_counts only)
|
||||
</code></pre></div>
|
||||
<p><strong>Cleanup:</strong> unless <code>--keep-intermediate</code> is set, <code>remove_build_artifacts</code> deletes <code>dereplicated.skmer.zst</code>, <code>mphf1.bin</code>, and <code>counts1.bin</code> after all partitions are indexed.</p>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for data structure details.</p>
|
||||
<p><strong>Query path (exact evidence):</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXXX/
|
||||
→ MPHF(q) → slot s
|
||||
@@ -1204,7 +1333,13 @@ branching / dead-end → unitig start or end
|
||||
→ match : return payload[s] ← exact hit
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
</code></pre></div>
|
||||
<p><code>superkmers.bin.gz</code> is no longer needed at this point and can be deleted.</p>
|
||||
<p><strong>Query path (approximate evidence):</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>query kmer q
|
||||
→ MPHF(q) → slot s
|
||||
→ fingerprint[s] matches seq_hash(q)?
|
||||
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
|
||||
→ no : kmer absent
|
||||
</code></pre></div>
|
||||
<div class="footnote">
|
||||
<hr />
|
||||
<ol>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -64,7 +64,7 @@
|
||||
<div data-md-component="skip">
|
||||
|
||||
|
||||
<a href="#on-disk-collection-structure" class="md-skip">
|
||||
<a href="#on-disk-index-layout" class="md-skip">
|
||||
Skip to content
|
||||
</a>
|
||||
|
||||
@@ -575,6 +575,24 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
On-disk storage
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
|
||||
|
||||
@@ -592,6 +610,174 @@
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -659,6 +845,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -737,6 +951,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -874,6 +1144,163 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
@@ -889,9 +1316,131 @@
|
||||
|
||||
|
||||
|
||||
<h1 id="on-disk-collection-structure">On-disk collection structure</h1>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p>
|
||||
<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p>
|
||||
<h1 id="on-disk-index-layout">On-disk index layout</h1>
|
||||
<h2 id="directory-tree">Directory tree</h2>
|
||||
<div class="highlight"><pre><span></span><code><index_root>/
|
||||
index.meta ← JSON: IndexMeta
|
||||
scatter.done ← sentinel: scatter phase complete
|
||||
count.done ← sentinel: dereplicate + count complete
|
||||
index.done ← sentinel: MPHF index fully built
|
||||
spectrums/
|
||||
<label>.json ← kmer frequency spectrum per genome
|
||||
partitions/
|
||||
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers }
|
||||
layer_0/
|
||||
unitigs.bin ← binary unitig sequences (2-bit packed)
|
||||
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
|
||||
mphf.bin ← serialised PtrHash MPHF
|
||||
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
|
||||
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
|
||||
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
|
||||
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
|
||||
presence/ ← PersistentBitMatrix (if presence mode, merge)
|
||||
layer_1/ ← added by merge; same structure as layer_0
|
||||
layer_2/ …
|
||||
part_00001/ …
|
||||
</code></pre></div>
|
||||
<h2 id="state-machine-sentinels">State machine (sentinels)</h2>
|
||||
<p>The sentinels are touched atomically at the end of each pipeline stage.
|
||||
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
|
||||
detected as the lowest sentinel present.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>State</th>
|
||||
<th>Sentinel present</th>
|
||||
<th>Meaning</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>Empty</code></td>
|
||||
<td>—</td>
|
||||
<td><code>index.meta</code> exists; scatter not started or interrupted</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Scattered</code></td>
|
||||
<td><code>scatter.done</code></td>
|
||||
<td>All super-kmers routed to partition files</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Counted</code></td>
|
||||
<td><code>count.done</code></td>
|
||||
<td>Partitions dereplicated; <code>spectrums/</code> written</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Indexed</code></td>
|
||||
<td><code>index.done</code></td>
|
||||
<td>All MPHF layers built; index ready for queries</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="indexmeta-indexmeta">index.meta (IndexMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"version"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"kmer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">31</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"minimizer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">11</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"n_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"with_counts"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Exact"</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"block_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span>
|
||||
<span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="nt">"genomes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
|
||||
<span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"label"</span><span class="p">:</span><span class="w"> </span><span class="s2">"genome_A"</span><span class="p">,</span><span class="w"> </span><span class="nt">"meta"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"species"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Homo sapiens"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="w"> </span><span class="p">]</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>n_bits</code> determines the partition count: <code>2^n_bits</code> directories under <code>partitions/</code>.</p>
|
||||
<p><code>evidence</code> is either the string <code>"Exact"</code> or <code>{"Approx": {"b": 8, "z": 1}}</code>.</p>
|
||||
<p><code>block_bits</code> controls the <code>.idx</code> granularity: one offset entry every <code>2^block_bits</code>
|
||||
chunks. <code>block_bits=0</code> stores one entry per chunk (O(1) random access, largest <code>.idx</code>).</p>
|
||||
<p><code>GenomeInfo.meta</code> is a free-form string→string map for categorical metadata (e.g.
|
||||
taxonomy, sample origin). It is optional; defaults to empty.</p>
|
||||
<h2 id="layer-files">Layer files</h2>
|
||||
<h3 id="unitigsbin">unitigs.bin</h3>
|
||||
<p>2-bit packed binary unitig sequences. Each record: 1 byte <code>seql_minus_k</code>
|
||||
(nucleotide length − k), followed by <code>ceil((seql_minus_k + k) / 4)</code> bytes of
|
||||
packed sequence. Long unitigs are transparently split into overlapping chunks
|
||||
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.</p>
|
||||
<h3 id="unitigsbinidx-exact-only">unitigs.bin.idx (Exact only)</h3>
|
||||
<p>Magic <code>UIX3</code>, little-endian header: <code>block_bits</code> (u32), <code>n_unitigs</code> (u32),
|
||||
<code>n_kmers</code> (u64), then <code>ceil(n_unitigs / 2^block_bits) + 1</code> byte-offset entries
|
||||
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.</p>
|
||||
<h3 id="mphfbin">mphf.bin</h3>
|
||||
<p>PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
|
||||
2-bit) to a slot index in <code>[0, n_kmers)</code>.</p>
|
||||
<h3 id="layer_metajson-layermeta">layer_meta.json (LayerMeta)</h3>
|
||||
<p><div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"exact"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
or
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"approx"</span><span class="p">,</span><span class="w"> </span><span class="nt">"b"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w"> </span><span class="nt">"z"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div></p>
|
||||
<h3 id="evidencebin-exact">evidence.bin (Exact)</h3>
|
||||
<p>One <code>(chunk_id: u32, rank: u8)</code> record per MPHF slot, packed. Used to verify
|
||||
that the kmer mapped to a slot is actually present: <code>unitigs.bin[chunk_id][rank]</code>
|
||||
is re-read and compared against the query.</p>
|
||||
<h3 id="fingerprintbin-approx">fingerprint.bin (Approx)</h3>
|
||||
<p><code>b</code>-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
|
||||
False-positive rate per query ≈ <code>1/2^b</code>. With Findere parameter <code>z ≥ 2</code>,
|
||||
<code>z</code> consecutive k-mers must all match, reducing the effective FP rate to
|
||||
approximately <code>W / 2^(b·z)</code> per read of length <code>L</code>
|
||||
(where <code>W = L − k − z + 2</code>).</p>
|
||||
<h3 id="counts-persistentcompactintmatrix">counts/ (PersistentCompactIntMatrix)</h3>
|
||||
<p>Present when <code>with_counts=true</code>. One column per genome; each row holds the
|
||||
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
|
||||
during indexing and merge.</p>
|
||||
<h3 id="presence-persistentbitmatrix">presence/ (PersistentBitMatrix)</h3>
|
||||
<p>Present when the layer was built in presence/absence mode (merge path).
|
||||
One bit per genome per MPHF slot. Written during merge; never present on a
|
||||
freshly indexed single-genome layer.</p>
|
||||
<h2 id="metajson-partitionmeta">meta.json (PartitionMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"n_layers"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Records how many <code>layer_N/</code> directories exist under <code>index/</code>. Incremented by
|
||||
each merge that adds a layer.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -751,6 +751,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -829,6 +857,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1046,61 +1130,49 @@
|
||||
|
||||
<h1 id="superkmer-implementation">SuperKmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p>A super-kmer is stored as a <strong>32-bit header</strong> followed by a <strong>byte-aligned nucleotide sequence</strong> (2 bits/base, nucleotide 0 at the MSB of the first byte):</p>
|
||||
<p><code>SuperKmer</code> holds two separate fields:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SuperKmer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">count</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">inner</span><span class="p">:</span><span class="w"> </span><span class="nc">PackedSeq</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>PackedSeq</code> stores a 2-bit packed DNA sequence as a heap-allocated <code>Box<[u8]></code> plus a <code>tail: u8</code> field:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Bits</th>
|
||||
<th>Type</th>
|
||||
<th>Role</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>COUNT</td>
|
||||
<td>24</td>
|
||||
<td>Occurrence count (≤ 16 M)</td>
|
||||
<td><code>tail</code></td>
|
||||
<td><code>u8</code></td>
|
||||
<td>Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NKMERS</td>
|
||||
<td>8</td>
|
||||
<td>Number of kmers (= seq_length − k + 1, range 1–255)</td>
|
||||
<td><code>seq</code></td>
|
||||
<td><code>Box<[u8]></code></td>
|
||||
<td>2-bit packed bytes, nucleotide 0 at bits 7–6 of <code>seq[0]</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Bit layout (MSB to LSB): <code>[31:8] COUNT [7:0] NKMERS</code></p>
|
||||
<p>NKMERS is stored as a raw <code>u8</code> in <strong>kmer units</strong>, not nucleotides. The nucleotide length is recovered as <code>NKMERS + k − 1</code>. This avoids the awkward wrapping convention (<code>0 = 256</code>) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>unit</th>
|
||||
<th>u8 covers</th>
|
||||
<th>max nucleotides</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>nucleotides</td>
|
||||
<td>255 nt</td>
|
||||
<td>225 kmers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>kmers</strong></td>
|
||||
<td><strong>255 kmers</strong></td>
|
||||
<td><strong>285 nt</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The public accessors:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">n_kmers</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">K</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">);</span><span class="w"> </span><span class="p">}</span>
|
||||
<p>Nucleotide length is recovered without storing it explicitly:</p>
|
||||
<div class="highlight"><pre><span></span><code>seql = (seq.len() - 1) * 4 + tail_count(tail)
|
||||
</code></pre></div>
|
||||
<p>There is no packed header word — <code>count</code> and the sequence live in separate fields.</p>
|
||||
<p>The on-disk binary format (produced by <code>write_to_binary</code>) is:</p>
|
||||
<div class="highlight"><pre><span></span><code>[varint(count)] [u8: seql − k] [packed bytes…]
|
||||
</code></pre></div>
|
||||
<p><code>seql − k</code> fits in a <code>u8</code> when <code>n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)</code>. If a super-kmer exceeds 256 kmers, <code>write_to_binary</code> splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by <code>read_from_binary</code>.</p>
|
||||
<p>The public accessors operate on the struct fields directly:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">inner</span><span class="p">.</span><span class="n">seql</span><span class="p">()</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).</p>
|
||||
<p>The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.</p>
|
||||
<h2 id="ascii-encoding-and-decoding">ASCII encoding and decoding</h2>
|
||||
<p>Two lookup tables handle ASCII ↔ 2-bit conversion:</p>
|
||||
<ul>
|
||||
@@ -1125,7 +1197,7 @@
|
||||
</code></pre></div>
|
||||
<p><code>REVCOMP4</code> is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.</p>
|
||||
<p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 − seql × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice<u8, Msb0>::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">seql</span><span class="p">();</span>
|
||||
<span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span>
|
||||
<span class="n">bits</span><span class="p">.</span><span class="n">rotate_left</span><span class="p">(</span><span class="n">shift</span><span class="p">)</span>
|
||||
<span class="n">bits</span><span class="p">[</span><span class="n">len</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">shift</span><span class="o">..</span><span class="p">].</span><span class="n">fill</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
|
||||
@@ -1143,7 +1215,7 @@
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<h2 id="minimizer-sliding-window">Minimizer sliding window</h2>
|
||||
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which maintains the current minimizer with a <strong>monotonic deque</strong> over a sliding window of W = k − m + 1 m-mer positions.</p>
|
||||
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which tracks the current minimizer with a <strong>monotonic deque</strong> (<code>Ring<MmerItem, 32></code>) inside <code>RollingStat</code>, a rolling-window entropy and minimizer tracker.</p>
|
||||
<p>Each deque entry stores:</p>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1167,20 +1239,11 @@
|
||||
<tr>
|
||||
<td><code>hash</code></td>
|
||||
<td>u64</td>
|
||||
<td><span class="arithmatex">\(H(\text{canonical})\)</span> — ordering key for random minimizer selection</td>
|
||||
<td><code>hash_kmer(canonical << (64 − 2m))</code> — ordering key for random minimizer selection</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The hash <span class="arithmatex">\(H\)</span> is the seeded splitmix64 finalizer (see <a href="../../theory/minimizer/">Minimizer selection</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">hash_mmer</span><span class="p">(</span><span class="n">canonical</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u64</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">canonical</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="mh">0x9e3779b97f4a7c15</span><span class="p">;</span><span class="w"> </span><span class="c1">// seed: eliminates fixed point at 0</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">30</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0xbf58476d1ce4e5b9</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">27</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0x94d049bb133111eb</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">31</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>The hash uses the seeded splitmix64 finalizer (<code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>), the same function as <code>kmer::hash_kmer</code>.</p>
|
||||
<p>On each new nucleotide, once the window is full, the deque is updated:</p>
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — minimizer deque update</p>
|
||||
@@ -1196,17 +1259,21 @@
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<p>The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.</p>
|
||||
<p>A super-kmer boundary is emitted when the minimizer changes: <code>deque.front.hash ≠ prev_hash</code>. The <code>canonical</code> field of the front entry is <strong>not</strong> used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key <span class="arithmatex">\(H(\text{canonical})\)</span> can be recomputed independently at routing time from the stored <code>minimizer_pos</code>, without inheriting the minimum-order-statistic bias (see <a href="../../theory/minimizer/#partition-key-independence">Minimizer selection — partition key independence</a>).</p>
|
||||
<p>A super-kmer boundary is emitted when the minimizer changes: <code>current_minimizer != prev_minimizer</code>. <code>SuperKmerIter</code> also emits a boundary when:</p>
|
||||
<ul>
|
||||
<li>entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1)</li>
|
||||
<li>super-kmer length reaches 256 nucleotides (cursor retreated by k)</li>
|
||||
</ul>
|
||||
<h2 id="kmer-extraction">Kmer extraction</h2>
|
||||
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i, k)</code>, which returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Result</span><span class="o"><</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">></span>
|
||||
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i)</code>, which delegates to <code>PackedSeq::extract::<KLen>(i)</code> and returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Result</span><span class="o"><</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a big-endian <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p>
|
||||
<p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p>
|
||||
<hr />
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — Super-kmer reverse complement</p>
|
||||
<div class="highlight"><pre><span></span><code>procedure SuperKmerRevcomp(seq, SEQL):
|
||||
seql ← NKMERS + k − 1 -- nucleotide length
|
||||
seql ← nucleotide length
|
||||
n ← ⌈seql / 4⌉ -- number of bytes
|
||||
shift ← n × 8 − seql × 2 -- padding bits to flush
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
+294
-1
@@ -213,6 +213,17 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#subcommands" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Subcommands
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -222,6 +233,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Parameter constraints (enforced at CLI)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#genome-label-constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Genome label constraints
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -714,6 +747,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -792,6 +853,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -935,6 +1052,17 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#subcommands" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Subcommands
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -944,6 +1072,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#parameter-constraints-enforced-at-cli" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Parameter constraints (enforced at CLI)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#genome-label-constraints" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Genome label constraints
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -976,12 +1126,155 @@
|
||||
|
||||
<h1 id="obikmer">obikmer</h1>
|
||||
<p><code>obikmer</code> is a Rust tool for manipulation, counting, indexing, and set operations on DNA sequences represented as kmer sets.</p>
|
||||
<h2 id="subcommands">Subcommands</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Subcommand</th>
|
||||
<th>Purpose</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>superkmer</code></td>
|
||||
<td>Extract super-kmers from a sequence file and write to stdout</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>index</code></td>
|
||||
<td>Build a complete genome index (scatter → dereplicate → count → layered MPHF)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>merge</code></td>
|
||||
<td>Merge multiple built indexes into one</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>rebuild</code></td>
|
||||
<td>Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>query</code></td>
|
||||
<td>Query an index with sequences and annotate matches</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>dump</code></td>
|
||||
<td>Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>annotate</code></td>
|
||||
<td>Add or update genome metadata from a CSV file; or dump metadata as CSV</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>distance</code></td>
|
||||
<td>Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>unitig</code></td>
|
||||
<td>Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as <code>rebuild</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>estimate</code></td>
|
||||
<td>Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>reindex</code></td>
|
||||
<td>Convert an index's evidence in-place: exact ↔ approx</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>utils</code></td>
|
||||
<td>Miscellaneous index utilities: <code>--new-label NEW=OLD</code> renames a genome label; <code>--upgrade-index</code> adds missing <code>layer_meta.json</code> to old indexes</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>pack</code></td>
|
||||
<td>Pack per-column matrix files into single-file format to reduce query I/O</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="constraints">Constraints</h2>
|
||||
<ul>
|
||||
<li>Target scale: individual genome datasets, tens of Gbases</li>
|
||||
<li>Maximum efficiency in computation, memory, and disk usage</li>
|
||||
<li>Input formats: FASTA, FASTQ, gzip, streaming stdin</li>
|
||||
<li>k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)</li>
|
||||
<li>Canonical form: <code>min(kmer, revcomp(kmer))</code> reduces strand-symmetric space by half</li>
|
||||
<li>Input formats for <code>index</code>/<code>superkmer</code>: FASTA (<code>.fa</code>, <code>.fasta</code>), FASTQ (<code>.fq</code>, <code>.fastq</code>), GenBank flat file (<code>.gb</code>, <code>.gbk</code>, <code>.gbff</code>), all optionally gzip-compressed; directories expanded recursively; streaming stdin via <code>-</code></li>
|
||||
<li>Input formats for <code>query</code>: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via <code>-</code></li>
|
||||
</ul>
|
||||
<h2 id="parameter-constraints-enforced-at-cli">Parameter constraints (enforced at CLI)</h2>
|
||||
<p>All constraints below are checked by <code>CommonArgs::validate()</code> at the start of <code>superkmer</code> and <code>index</code>. Invalid values exit immediately with an error.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Parameter</th>
|
||||
<th>Constraint</th>
|
||||
<th>Reason</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>k (<code>--kmer-size</code>)</td>
|
||||
<td>odd</td>
|
||||
<td>even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>k (<code>--kmer-size</code>)</td>
|
||||
<td>k ∈ [11, 31]</td>
|
||||
<td>k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>m (<code>--minimizer-size</code>)</td>
|
||||
<td>odd</td>
|
||||
<td>same palindrome argument as k</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>m (<code>--minimizer-size</code>)</td>
|
||||
<td>3 ≤ m ≤ k−1</td>
|
||||
<td>minimizer must be strictly shorter than the kmer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>z (<code>-z</code>, Findere, <code>index --approx</code> only)</td>
|
||||
<td>z ≤ k−1</td>
|
||||
<td>effective indexed kmer size is k−z+1; z ≥ k would make it ≤ 0</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="genome-label-constraints">Genome label constraints</h2>
|
||||
<p>Genome labels are arbitrary Unicode strings with the following restrictions:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Character</th>
|
||||
<th>Forbidden</th>
|
||||
<th>Reason</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>/</code></td>
|
||||
<td>yes</td>
|
||||
<td>filesystem path separator</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>=</code></td>
|
||||
<td>yes</td>
|
||||
<td><code>--new-label</code> parser separator</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>\0</code></td>
|
||||
<td>yes</td>
|
||||
<td>null byte</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>\n</code> <code>\r</code> <code>\t</code></td>
|
||||
<td>yes</td>
|
||||
<td>break CSV output</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>spaces</td>
|
||||
<td><strong>allowed</strong></td>
|
||||
<td>use shell quoting: <code>--new-label 'new label=old label'</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Empty labels are also rejected. Labels derived automatically from the index directory name (when <code>--label</code> is omitted) are not validated since they come from the filesystem and are already safe.</p>
|
||||
<h2 id="priority-operations">Priority operations</h2>
|
||||
<ul>
|
||||
<li>Kmer counting (frequencies)</li>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
+87
-2
@@ -746,6 +746,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -824,6 +852,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1038,11 +1122,12 @@
|
||||
<h2 id="kmers">Kmers</h2>
|
||||
<p>A <strong>kmer</strong> is a DNA subsequence of fixed length k. Two constraints govern the choice of k:</p>
|
||||
<ul>
|
||||
<li><strong>k ∈ [11, 31]</strong>: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.</li>
|
||||
<li><strong>k ∈ [11, 31]</strong>: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k < 11 yields insufficient specificity).</li>
|
||||
<li><strong>k is odd</strong>: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form <code>min(kmer, revcomp(kmer))</code> is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.</li>
|
||||
</ul>
|
||||
<p>Both constraints are <strong>enforced at CLI entry</strong> by <code>CommonArgs::validate()</code> in <code>superkmer</code> and <code>index</code>. Passing an invalid k exits immediately with an error message.</p>
|
||||
<h2 id="super-kmers">Super-kmers</h2>
|
||||
<p>A <strong>super-kmer</strong> is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides. Each kmer of the run carries the same <strong>canonical minimizer</strong>. The <strong>canonical minimizer</strong> of a kmer is the smallest value of <code>min(m-mer, revcomp(m-mer))</code> over all m-mers within the kmer (m < k, m odd), with the constraint that <strong>non-degenerate m-mers are always preferred</strong> over degenerate ones. A degenerate m-mer is one composed of a single repeated nucleotide (all-A, all-C, all-G, or all-T); such m-mers are selected only if no non-degenerate candidate exists in the window.</p>
|
||||
<p>A <strong>super-kmer</strong> is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides, sharing the same <strong>canonical minimizer</strong>. The <strong>canonical minimizer</strong> of a kmer is the m-mer (m < k) whose canonical hash <code>hash_kmer(min(m-mer, revcomp(m-mer)))</code> is smallest over all m-mers in the kmer window. The hash function is a <code>mix64</code>-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.</p>
|
||||
<h3 id="canonical-super-kmers">Canonical super-kmers</h3>
|
||||
<p>A <strong>canonical super-kmer</strong> is the lexicographic minimum of a super-kmer and its reverse complement:</p>
|
||||
<div class="highlight"><pre><span></span><code>canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
|
||||
|
||||
Binary file not shown.
File diff suppressed because it is too large
Load Diff
@@ -718,6 +718,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -796,6 +824,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1010,17 +1094,20 @@
|
||||
<p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base & 0b11</code>.</p>
|
||||
<h2 id="kmer-encoding">Kmer encoding</h2>
|
||||
<p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): <code>(kmer >> (62 - 2*i)) & 0b11</code>.</p>
|
||||
<p>Reverse complement is computed via a <strong>16-bit lookup table</strong> (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.</p>
|
||||
<p>Reverse complement is computed by <strong>bit manipulation in four steps</strong>, with no lookup table:</p>
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — Kmer reverse complement</p>
|
||||
<div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k):
|
||||
raw ← TABLE16[kmer & 0xFFFF] << 48
|
||||
| TABLE16[(kmer >> 16) & 0xFFFF] << 32
|
||||
| TABLE16[(kmer >> 32) & 0xFFFF] << 16
|
||||
| TABLE16[(kmer >> 48) & 0xFFFF]
|
||||
return raw << (64 - 2*k)
|
||||
x ← ~kmer -- complement all bases
|
||||
x ← swap_bytes(x) -- reverse byte order
|
||||
x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
|
||||
| ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte
|
||||
x ← ((x >> 2) & 0x3333333333333333)
|
||||
| ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble
|
||||
return x << (64 - 2*k) -- re-align to MSB
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<p>The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.</p>
|
||||
<p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p>
|
||||
<div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer))
|
||||
</code></pre></div>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -773,6 +773,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -851,6 +879,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1109,7 +1193,7 @@
|
||||
<h2 id="final-score">Final score</h2>
|
||||
<p>The filter computes <span class="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p>
|
||||
<div class="arithmatex">\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]</div>
|
||||
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) \leq \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
|
||||
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) < \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
|
||||
<h2 id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2>
|
||||
<p><span class="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <span class="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p>
|
||||
<p>For the normalised score <span class="arithmatex">\(\hat{H}\)</span>, dividing by <span class="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -718,6 +718,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -796,6 +824,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -762,6 +762,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -840,6 +868,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -1,9 +1,12 @@
|
||||
# Rebuild filters and ingroup/outgroup predicates
|
||||
# Kmer filtering and ingroup/outgroup predicates
|
||||
|
||||
The `rebuild` command compacts an existing index into a new single-layer index,
|
||||
optionally keeping only k-mers that satisfy a set of filters.
|
||||
Filters can operate on raw quorum counts over all genomes, or on pre-defined
|
||||
**ingroup** and **outgroup** genome sets derived from genome metadata.
|
||||
The `rebuild`, `dump`, and `unitig` commands all share the same filtering
|
||||
system. Filters can select k-mers based on per-genome quorum counts, optionally
|
||||
restricted to **ingroup** and **outgroup** genome sets derived from genome
|
||||
metadata.
|
||||
|
||||
`rebuild` additionally accepts `--min-total-count` / `--max-total-count` filters
|
||||
that operate on the sum of counts across all genomes.
|
||||
|
||||
## Predicate syntax
|
||||
|
||||
@@ -73,8 +76,8 @@ For each genome:
|
||||
| `--ingroup` | `--outgroup` | Effective behaviour |
|
||||
|-------------|--------------|---------------------|
|
||||
| not set | not set | all genomes form the ingroup |
|
||||
| set | not set | only `--min-count`/`--min-frac` apply to matched genomes |
|
||||
| not set | set | only `--max-count`/`--max-frac` apply to matched genomes |
|
||||
| set | not set | only ingroup quorum flags apply |
|
||||
| not set | set | only outgroup quorum flags apply |
|
||||
| set | set | both constraints apply simultaneously |
|
||||
|
||||
## Quorum flags
|
||||
@@ -89,10 +92,13 @@ For each genome:
|
||||
| `--max-outgroup-count N` | outgroup | k-mer present in at most N outgroup genomes |
|
||||
| `--min-outgroup-frac F` | outgroup | k-mer present in at least fraction F of outgroup genomes |
|
||||
| `--max-outgroup-frac F` | outgroup | k-mer present in at most fraction F of outgroup genomes |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N |
|
||||
| `--min-total-count N` | all genomes | sum of per-genome counts ≥ N (`rebuild` only) |
|
||||
| `--max-total-count N` | all genomes | sum of per-genome counts ≤ N (`rebuild` only) |
|
||||
| `--presence-threshold N` | all | per-genome count > N to be considered "present" (default 0) |
|
||||
|
||||
Defaults: mins = 0 (no lower bound), max counts = group size, max fracs = 1.0
|
||||
(no upper bound). A filter with all defaults is a no-op.
|
||||
|
||||
Fractions are computed over the size of the classified group, not over total
|
||||
genome count. An empty group (no genome classified as ingroup/outgroup) never
|
||||
triggers a filter failure.
|
||||
@@ -107,17 +113,18 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all *Betula*:
|
||||
Keep k-mers found in at least 2 *Betula nana* genomes and absent from all
|
||||
other *Betula*:
|
||||
|
||||
```sh
|
||||
obikmer rebuild src --output dst \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "genus=Betula" \
|
||||
--min-count 2 \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
Use taxonomic paths — keep k-mers present in ≥ 50 % of the *Betula* clade
|
||||
@@ -128,7 +135,7 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "taxon~/Betulaceae/Betula" \
|
||||
--outgroup "taxon!~/Betulaceae" \
|
||||
--min-frac 0.5 \
|
||||
--max-frac 0.1
|
||||
--max-outgroup-frac 0.1
|
||||
```
|
||||
|
||||
Multiple outgroup predicates (OR): exclude k-mers present in *Alnus* or *Carpinus*:
|
||||
@@ -138,7 +145,28 @@ obikmer rebuild src --output dst \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "genus=Alnus" \
|
||||
--outgroup "genus=Carpinus" \
|
||||
--max-count 0
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
The same flags work identically for `dump` and `unitig`. To dump only k-mers
|
||||
specific to *Betula nana*:
|
||||
|
||||
```sh
|
||||
obikmer dump myindex \
|
||||
--ingroup "species=Betula_nana" \
|
||||
--outgroup "*" \
|
||||
--min-count 1 \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
To enumerate unitigs of the *Betula*-specific subgraph:
|
||||
|
||||
```sh
|
||||
obikmer unitig myindex \
|
||||
--ingroup "genus=Betula" \
|
||||
--outgroup "*" \
|
||||
--min-count 2 \
|
||||
--max-outgroup-count 0
|
||||
```
|
||||
|
||||
## Implementation
|
||||
@@ -146,9 +174,15 @@ obikmer rebuild src --output dst \
|
||||
- **`obikpartitionner::filter::GroupQuorumFilter`** — implements `KmerFilter`
|
||||
using pre-computed ingroup and outgroup index vectors. The heavy logic
|
||||
(predicate parsing, three-value evaluation, genome classification) happens
|
||||
once before the rebuild loop; each k-mer row evaluation is a simple index
|
||||
once before any iteration; each k-mer row evaluation is a simple index
|
||||
lookup and counter.
|
||||
|
||||
- **`obikmer::cmd::predicate`** — predicate parsing (`MetaPred`), path matching
|
||||
(`path_matches`), three-value AND/OR evaluation, and `build_group_filter`
|
||||
which returns a ready-to-use `GroupQuorumFilter`.
|
||||
- **`obikmer::cmd::predicate::FilterArgs`** — shared `clap` argument group
|
||||
embedded via `#[command(flatten)]` in `RebuildArgs`, `DumpArgs`, and
|
||||
`UnitigArgs`. `FilterArgs::build_filters()` returns a ready-to-use filter
|
||||
list.
|
||||
|
||||
- **`obikpartitionner::KmerPartition::iter_partition_kmers`** — accepts
|
||||
`filters: &[Box<dyn KmerFilter>]` and applies them per-kmer before invoking
|
||||
the callback. `rebuild`, `dump`, and `unitig` all go through this single
|
||||
entry point.
|
||||
|
||||
+5
-4
@@ -9,15 +9,16 @@
|
||||
| `superkmer` | Extract super-kmers from a sequence file and write to stdout |
|
||||
| `index` | Build a complete genome index (scatter → dereplicate → count → layered MPHF) |
|
||||
| `merge` | Merge multiple built indexes into one |
|
||||
| `rebuild` | Filter and compact an existing index into a new single-layer index |
|
||||
| `rebuild` | Filter and compact an existing index into a new single-layer index; supports ingroup/outgroup predicates on genome metadata |
|
||||
| `query` | Query an index with sequences and annotate matches |
|
||||
| `dump` | Dump all indexed kmers as CSV (kmer + per-genome counts or presence) |
|
||||
| `dump` | Dump all indexed k-mers as CSV (kmer + per-genome counts or presence); supports the same ingroup/outgroup filtering as `rebuild` |
|
||||
| `annotate` | Add or update genome metadata from a CSV file; or dump metadata as CSV |
|
||||
| `distance` | Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees |
|
||||
| `unitig` | Dump unitigs from a built index to stdout (debug) |
|
||||
| `unitig` | Build a global de Bruijn graph across all partitions and enumerate its unitigs as FASTA; supports the same ingroup/outgroup filtering as `rebuild` |
|
||||
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
|
||||
| `reindex` | Convert an index's evidence in-place: exact ↔ approx |
|
||||
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) |
|
||||
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label; `--upgrade-index` adds missing `layer_meta.json` to old indexes |
|
||||
| `pack` | Pack per-column matrix files into single-file format to reduce query I/O |
|
||||
|
||||
## Constraints
|
||||
|
||||
|
||||
+1
-1
@@ -49,7 +49,7 @@ nav:
|
||||
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
||||
- PersistentBitVec: implementation/persistent_bit_vec.md
|
||||
- Merge command: implementation/merge.md
|
||||
- Rebuild filters: implementation/rebuild_filter.md
|
||||
- Kmer filtering (rebuild/dump/unitig): implementation/rebuild_filter.md
|
||||
- Architecture:
|
||||
- Sequences: architecture/sequences/invariant.md
|
||||
- Kmer index: architecture/index_architecture.md
|
||||
|
||||
Reference in New Issue
Block a user