docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -243,19 +243,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -347,6 +356,18 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../evidence_elimination/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -383,6 +404,30 @@
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../merge/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="../rebuild_filter/">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
@@ -454,19 +499,28 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix="">
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<a class="md-nav__link" href="#two-reading-paths">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: rope
|
||||
Two reading paths
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#allocation-policy">
|
||||
<a class="md-nav__link" href="#record-path-chunk-reader">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Allocation policy
|
||||
Record path: chunk reader
|
||||
|
||||
</span>
|
||||
</a>
|
||||
</li>
|
||||
<li class="md-nav__item">
|
||||
<a class="md-nav__link" href="#output-type-rope">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Output type: Rope
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -506,68 +560,77 @@
|
||||
<div class="md-content" data-md-component="content">
|
||||
<article class="md-content__inner md-typeset">
|
||||
<h1 id="chunk-reader-implementation">Chunk reader — implementation</h1>
|
||||
<p>The <code>obiread</code> crate provides a streaming iterator that reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. Chunks are consumed in parallel by downstream workers.</p>
|
||||
<h2 id="output-type-rope">Output type: rope</h2>
|
||||
<p>Each chunk is a <code>Vec<Bytes></code> — a <strong>rope</strong>: a list of reference-counted byte slices that are not necessarily contiguous in memory. The consumer iterates over the slices in order.</p>
|
||||
<p>Using <code>bytes::Bytes</code> means the split at the record boundary is O(1): <code>Bytes::split_to(n)</code> adjusts a reference counter, not memory. No <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="allocation-policy">Allocation policy</h2>
|
||||
<p><code>obiread</code> exposes two distinct sequence reading paths, each optimised for a different use case.</p>
|
||||
<h2 id="two-reading-paths">Two reading paths</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Case</th>
|
||||
<th>Cost</th>
|
||||
<th>Path</th>
|
||||
<th>API</th>
|
||||
<th>Output unit</th>
|
||||
<th>Per-record identity</th>
|
||||
<th>Use case</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Boundary found in the current block (common)</td>
|
||||
<td>zero extra allocation — <code>split_to</code> only</td>
|
||||
<td><strong>Record path</strong></td>
|
||||
<td><code>read_sequence_chunks</code> → <code>parse_chunk</code></td>
|
||||
<td><code>SeqRecord</code> (id + raw sequence + normalised rope)</td>
|
||||
<td>yes</td>
|
||||
<td><code>query</code> — must read complete records</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Boundary straddles multiple blocks (sequence > block size, rare)</td>
|
||||
<td>one allocation to pack the rope into a flat buffer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>EOF flush</td>
|
||||
<td>zero extra allocation</td>
|
||||
<td><strong>Stream path</strong></td>
|
||||
<td><code>open_nuc_stream</code></td>
|
||||
<td><code>NucPage</code> (flat normalised byte buffer)</td>
|
||||
<td>no</td>
|
||||
<td><code>index</code>, <code>superkmer</code> — bulk throughput</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The record path uses <code>Rope</code>-backed chunks and is described in detail below.
|
||||
The stream path (<code>NucStream</code> / <code>NucPage</code>) is described in the scatter section of <a href="../pipeline/">pipeline</a>.</p>
|
||||
<hr/>
|
||||
<h2 id="record-path-chunk-reader">Record path: chunk reader</h2>
|
||||
<p>The chunk reader reads FASTA or FASTQ files in fixed-size blocks and yields self-contained chunks, each ending on a complete sequence record boundary. <code>parse_chunk</code> then converts each chunk into a <code>Vec<SeqRecord></code>, where each record carries its identifier, raw sequence bytes, and a normalised rope ready for superkmer building.</p>
|
||||
<p>This path is mandatory for <code>query</code>, where superkmers must be tracked back to their originating sequence (id, kmer offset) for output annotation.</p>
|
||||
<h2 id="output-type-rope">Output type: Rope</h2>
|
||||
<p>Each chunk is a <code>Rope</code> — a segmented byte sequence: a <code>Vec</code> of blocks, where each block is a <code>Vec<Cell<u8>></code>. The consumer iterates over the blocks via a forward or backward cursor.</p>
|
||||
<p><code>Rope::split_off(pos)</code> splits at an absolute byte offset in O(log n) (binary search over block-start index). If <code>pos</code> falls inside a block, that block is split in two via <code>Vec::split_off</code> — no <code>memcpy</code> in the common case.</p>
|
||||
<h2 id="seqchunkiter">SeqChunkIter</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="cm">/* private */</span><span class="w"> </span><span class="p">}</span>
|
||||
|
||||
<span class="k">impl</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="w"> </span><span class="nb">Iterator</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">Bytes</span><span class="o">>></span><span class="p">;</span>
|
||||
<span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">io</span><span class="p">::</span><span class="nb">Result</span><span class="o"><</span><span class="n">Rope</span><span class="o">></span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fasta_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">fastq_chunks</span><span class="o"><</span><span class="n">R</span><span class="p">:</span><span class="w"> </span><span class="nc">Read</span><span class="o">></span><span class="p">(</span><span class="n">source</span><span class="p">:</span><span class="w"> </span><span class="nc">R</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">SeqChunkIter</span><span class="o"><</span><span class="n">R</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>next()</code> loop:</p>
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto rope
|
||||
2. probe check: if the boundary marker ("\n>" or "\n@") is absent from the
|
||||
last block, skip the splitter (avoids a full backward scan for nothing)
|
||||
3. call splitter on last block
|
||||
if found at offset n:
|
||||
remainder = last_block.split_to(n) ← O(1), zero copy
|
||||
return std::mem::take(&mut self.rope) ← the chunk
|
||||
4. if rope.len() > 1 (multi-block accumulation):
|
||||
pack rope into one flat buffer ← one alloc
|
||||
retry splitter on flat buffer
|
||||
5. if EOF: flush remaining rope as final chunk
|
||||
<div class="highlight"><pre><span></span><code>1. read one block of block_size bytes → push onto Rope
|
||||
2. call splitter(rope) → Option<abs_offset>
|
||||
if Some(pos):
|
||||
tail = rope.split_off(pos) ← O(log n), may split one block
|
||||
chunk = mem::replace(&mut rope, tail)
|
||||
return Some(Ok(chunk))
|
||||
3. if EOF and rope non-empty: return Some(Ok(rope)) as final chunk
|
||||
4. if EOF and rope empty: return None
|
||||
</code></pre></div>
|
||||
<p>The <code>Splitter</code> function signature is <code>fn(&Rope) -> Option<usize></code>. It returns the absolute byte offset of the start of the last complete record, or <code>None</code> if no boundary was found in the accumulated rope (need more data).</p>
|
||||
<h2 id="boundary-detection-fasta">Boundary detection — FASTA</h2>
|
||||
<p>Backward scan with a 2-state machine. Searches for <code>></code> immediately preceded by <code>\n</code> or <code>\r</code>:</p>
|
||||
<p>Backward scan with a 2-state machine. Searches (right to left) for <code>></code> followed by <code>\n</code> or <code>\r</code> (i.e., a <code>></code> that is preceded by a newline in forward order):</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
[*] --> Scanning
|
||||
Scanning --> FoundGt : '>'
|
||||
FoundGt --> Scanning : other
|
||||
FoundGt --> [*] : '\\n' / '\\r' ✓</code></pre>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record.</p>
|
||||
<p>Returns the byte offset of the <code>></code> that starts the last complete record. Returns <code>None</code> if only one <code>></code> is found (cannot confirm there is a prior complete record).</p>
|
||||
<h2 id="boundary-detection-fastq">Boundary detection — FASTQ</h2>
|
||||
<p>FASTQ records have a rigid 4-line structure (<code>@header</code>, sequence, <code>+</code>, quality). The <code>@</code> character (ASCII 64, Phred score 31) can appear legitimately in quality lines, making any forward heuristic unreliable. The backward scanner verifies the full structural context before accepting a candidate <code>@</code>.</p>
|
||||
<p>7-state machine (port of Go's <code>EndOfLastFastqEntry</code>), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<p>7-state machine (states 0–6), scanning from <strong>right to left</strong>. Each time a <code>+</code> is found, its position is saved as <code>restart</code>; any state mismatch resets the scan to that position.</p>
|
||||
<pre class="mermaid"><code>stateDiagram-v2
|
||||
direction LR
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -514,10 +514,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -558,10 +569,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -751,6 +784,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -829,6 +890,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -973,10 +1090,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1017,10 +1145,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1045,12 +1195,43 @@
|
||||
|
||||
|
||||
<h1 id="kmer-implementation">Kmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p><code>Kmer</code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code>:</p>
|
||||
<h2 id="types-and-layout">Types and layout</h2>
|
||||
<p><code>KmerOf<L></code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code> parameterized by a <code>KmerLength</code> marker:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="cp">#[repr(transparent)]</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Kmer</span><span class="p">(</span><span class="kt">u64</span><span class="p">);</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">KmerOf</span><span class="o"><</span><span class="n">L</span><span class="p">:</span><span class="w"> </span><span class="nc">KmerLength</span><span class="o">></span><span class="p">(</span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="n">PhantomData</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="p">);</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is <strong>not stored</strong> — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.</p>
|
||||
<p>Three marker types implement <code>KmerLength</code>:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Marker</th>
|
||||
<th><code>len()</code> source</th>
|
||||
<th>Used for</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>KLen</code></td>
|
||||
<td><code>params::k()</code></td>
|
||||
<td>k-mers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>MLen</code></td>
|
||||
<td><code>params::m()</code></td>
|
||||
<td>minimizers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ConstLen<N></code></td>
|
||||
<td>const generic <code>N</code></td>
|
||||
<td>tests</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Public aliases:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Kmer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KmerOf</span><span class="o"><</span><span class="n">KLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// k-mer, global k</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Minimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="o"><</span><span class="n">MLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// canonical m-mer, global m</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is <strong>not stored</strong> — every operation reads it from <code>L::len()</code>.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -1071,33 +1252,41 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="global-parameters">Global parameters</h2>
|
||||
<p><code>params::set_k(k)</code> / <code>params::k()</code> and <code>params::set_m(m)</code> / <code>params::m()</code> are backed by <code>OnceLock<usize></code> in production (write-once, panic on conflict) and by <code>thread_local! { Cell<usize> }</code> in test builds (per-thread, freely writable). <code>params::init(k, m)</code> sets both in one call.</p>
|
||||
<h2 id="encoding">Encoding</h2>
|
||||
<p><code>Kmer::from_ascii(ascii, k)</code> encodes the first k bytes of an ASCII slice using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<p><code>KmerOf::<L>::from_ascii(ascii)</code> encodes the first <code>L::len()</code> bytes using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">k</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">encode_base</span><span class="p">(</span><span class="n">ascii</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">u64</span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="decoding">Decoding</h2>
|
||||
<p><code>write_ascii(k, buf)</code> appends k ASCII characters to a caller-supplied <code>Vec<u8></code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii(k)</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<p><code>write_ascii(writer)</code> writes k ASCII characters to any <code>W: Write</code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii()</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<h2 id="reverse-complement">Reverse complement</h2>
|
||||
<p>Computed as pure arithmetic — no lookup table, no memory access:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="c1">// complement</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">swap_bytes</span><span class="p">();</span><span class="w"> </span><span class="c1">// reverse bytes</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap nibbles</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap 2-bit groups</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>After complementing, bytes are reversed (<code>swap_bytes</code>), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="canonical-form">Canonical form</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">Self</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">(</span><span class="n">k</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o">*</span><span class="bp">self</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="p">}</span>
|
||||
<h2 id="canonical-form-and-canonicalkmerof">Canonical form and <code>CanonicalKmerOf</code></h2>
|
||||
<p><code>canonical()</code> returns a <code>CanonicalKmerOf<L></code> — a distinct newtype that carries the same <code>u64</code> layout but enforces the invariant that the stored value equals <code>min(kmer, revcomp)</code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">CanonicalKmerOf</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">();</span>
|
||||
<span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Lexicographic minimum of forward and reverse-complement, comparing the raw <code>u64</code> values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.</p>
|
||||
<p><code>CanonicalKmerOf::from_raw_unchecked(raw)</code> is the only other public constructor, for trusted paths such as deserialisation.</p>
|
||||
<h2 id="sliding-window-helpers">Sliding window helpers</h2>
|
||||
<p><code>push_right(nuc)</code> / <code>push_left(nuc)</code> shift the window by one base in O(1). <code>is_overlapping(other)</code> checks whether the last k−1 nucleotides of <code>self</code> equal the first k−1 of <code>other</code>.</p>
|
||||
<h2 id="hashing">Hashing</h2>
|
||||
<p><code>hash_kmer(raw: u64) -> u64</code> computes <code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>, the seeded splitmix64 finalizer. <code>CanonicalKmerOf::seq_hash()</code> delegates to <code>hash_kmer</code>.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -757,6 +757,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-modes" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence modes
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-functions" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build functions
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -840,6 +862,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -918,6 +968,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1165,6 +1271,28 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-modes" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence modes
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-functions" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build functions
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1226,26 +1354,26 @@
|
||||
<h2 id="why-two-phases-are-needed">Why two phases are needed</h2>
|
||||
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of surviving unique kmers is not known until after counting and filtering low-abundance kmers.</p>
|
||||
<h3 id="phase-1-provisional-mphf-kmer-spectrum">Phase 1 — provisional MPHF + kmer spectrum</h3>
|
||||
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code>.</p>
|
||||
<p>Implemented in <code>obikpartitionner::KmerPartition::count_kmer()</code> → <code>count_partition()</code>.</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: read the dereplicated superkmer file; enumerate all unique canonical kmers into a <code>HashSet</code>. Exact count known after this pass.</li>
|
||||
<li><strong>Build a provisional MPHF</strong> (<code>GOFunction</code> from the <code>ph</code> crate) over the exact kmer set. Produces <code>mphf1.bin</code>.</li>
|
||||
<li><strong>Create <code>counts1.bin</code></strong>: one zero-initialised <code>u32</code> per MPHF slot (mmap'd).</li>
|
||||
<li><strong>Pass 2</strong>: re-read the dereplicated file; for each kmer, query <code>mphf1.get(kmer)</code> and atomically accumulate the superkmer count into <code>counts1[slot]</code>.</li>
|
||||
<li><strong>External sort</strong>: read the dereplicated superkmer file; extract the raw <code>u64</code> canonical kmer value for every kmer of every superkmer. Sort in RAM-bounded chunks (adaptive budget: 40% of available RAM ÷ n_threads, minimum 1 M kmers per chunk), then k-way merge with inline dedup. Result: <code>sorted_unique.bin</code> — a flat array of f0 distinct sorted <code>u64</code> values. Exact kmer count f0 is known at this point.</li>
|
||||
<li><strong>Build provisional MPHF</strong> (ptr_hash, same configuration as phase 2) over <code>sorted_unique.bin</code> using <code>new_from_par_iter</code>. Delete <code>sorted_unique.bin</code> immediately after. Persist to <code>mphf1.bin</code>.</li>
|
||||
<li><strong>Create <code>counts1.bin</code></strong>: <code>PersistentCompactIntVec</code> with f0 slots, zero-initialised.</li>
|
||||
<li><strong>Accumulation pass</strong>: re-read the dereplicated superkmer file; for each kmer in each superkmer, compute <code>slot = mphf.index(kmer.raw())</code> and increment <code>counts1[slot]</code> by the superkmer's COUNT.</li>
|
||||
<li><strong>Build kmer frequency spectrum</strong> from <code>counts1</code>: histogram <code>{count → n_kmers}</code>, totals f0 (distinct kmers) and f1 (total abundance). Written to <code>kmer_spectrum_raw.json</code> per partition, then merged globally.</li>
|
||||
</ol>
|
||||
<p>Files produced per partition:</p>
|
||||
<div class="highlight"><pre><span></span><code>part_XXXXX/
|
||||
mphf1.bin — GOFunction (provisional MPHF, discarded after phase 2)
|
||||
counts1.bin — [u32; n_kmers] kmer counts, mmap'd
|
||||
mphf1.bin — ptr_hash provisional MPHF (discarded after phase 2)
|
||||
counts1.bin — PersistentCompactIntVec, f0 × u32 kmer counts
|
||||
kmer_spectrum_raw.json — local frequency spectrum
|
||||
</code></pre></div>
|
||||
<h3 id="phase-2-definitive-mphf">Phase 2 — definitive MPHF</h3>
|
||||
<p>After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see <a href="../pipeline/">Construction pipeline</a>), the exact filtered kmer set is available via <code>unitigs.bin</code>.</p>
|
||||
<p><code>MphfLayer::build</code> is called on the unitig file:</p>
|
||||
<p><code>MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)</code> is called on the unitig directory:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: iterate all canonical kmers from <code>unitigs.bin</code> in parallel, build and store <code>mphf.bin</code> (ptr_hash).</li>
|
||||
<li><strong>Pass 2</strong>: iterate sequentially, fill <code>evidence.bin</code>, call the mode-specific <code>fill_slot</code> callback.</li>
|
||||
<li><strong>Pass 1</strong> (parallel): a <code>CanonicalKmerIter</code> — clonable via <code>Arc<Mmap></code>, no file reopening — is passed directly to <code>new_from_par_iter</code> via <code>par_bridge()</code>. No <code>.idx</code> is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces <code>mphf.bin</code>.</li>
|
||||
<li><strong>Pass 2</strong> (sequential): iterate with <code>iter_indexed_canonical_kmers</code>; fill evidence files; call <code>fill_slot(slot, kmer)</code> callback per kmer. For Exact/Hybrid, <code>.idx</code> is written at the end of this pass — never earlier.</li>
|
||||
</ol>
|
||||
<p><code>mphf1.bin</code> and <code>counts1.bin</code> are no longer needed after phase 2 and can be deleted.</p>
|
||||
<hr />
|
||||
@@ -1265,13 +1393,11 @@
|
||||
<p><strong>FMPH/FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
|
||||
<ul>
|
||||
<li>~2.1 bits/key — most compact; good query speed; deterministic construction</li>
|
||||
<li>Works well from an exact or slightly overestimated count</li>
|
||||
<li><code>GOFunction</code> (group-oriented variant) is the specific type used</li>
|
||||
<li><code>GOFunction</code> (group-oriented variant) was the original phase-1 choice; eliminated when the external sort made the exact count available at phase 1 as well</li>
|
||||
</ul>
|
||||
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
|
||||
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): <code>ph::fmph::GOFunction</code>. Compact, fast to build from the exact post-dedup kmer set. Query speed is secondary — the structure is only used during pass 2 of <code>count_kmer</code>.</p>
|
||||
<p><strong>Phase 2</strong> (persistent, queried repeatedly): <strong>ptr_hash</strong>. Exact key count is available from the unitig index; ptr_hash query speed (≥2.1×) and construction speed (≥3.1× over FMPH) are the decisive factors. The 2.4 bits/key overhead is acceptable.</p>
|
||||
<p>boomphf is eliminated: largest space overhead, streaming advantage does not apply.</p>
|
||||
<p><strong>Both phases</strong>: <strong>ptr_hash</strong>, same type alias and construction parameters. The external sort (phase 1) and the unitig index (phase 2) both provide the exact key count before MPHF construction, so ptr_hash's requirement is satisfied in both cases. Using a single MPHF implementation removes the <code>ph</code> crate dependency.</p>
|
||||
<p>boomphf: eliminated — largest space overhead, streaming advantage no longer needed. FMPH/GOFunction: eliminated — exact count available, ptr_hash is faster at equivalent compactness.</p>
|
||||
<hr />
|
||||
<h2 id="space-at-scale">Space at scale</h2>
|
||||
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
|
||||
@@ -1320,9 +1446,12 @@
|
||||
<h3 id="layer-structure">Layer structure</h3>
|
||||
<p>Each layer is a self-contained unit. See <a href="../obilayeredmap/">obilayeredmap</a> for the full on-disk layout. The MPHF-relevant files are:</p>
|
||||
<div class="highlight"><pre><span></span><code>layer_i/
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences (kmer evidence source)
|
||||
unitigs.bin.idx — random-access block index (block_bits controls granularity)
|
||||
mphf.bin — ptr_hash phase-2 MPHF
|
||||
evidence.bin — n × u32: (chunk_id: 25 bits | rank: 7 bits) per slot
|
||||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||||
[no layer_meta.json — mode stored once in partition-level meta.json]
|
||||
</code></pre></div>
|
||||
<p>Layers are <strong>disjoint</strong>: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:</p>
|
||||
<ol>
|
||||
@@ -1330,17 +1459,43 @@
|
||||
<li>Collect kmers of B not present in any layer → set <code>B \ A</code>.</li>
|
||||
<li>Build layer 1 from <code>B \ A</code> (dereplicate → count → De Bruijn → unitigs → <code>MphfLayer::build</code>).</li>
|
||||
</ol>
|
||||
<h3 id="evidence-modes">Evidence modes</h3>
|
||||
<p>Three evidence modes are supported via <code>IndexMode</code>, stored once in <code>PartitionMeta</code> at partition root. There is no <code>layer_meta.json</code>.</p>
|
||||
<p><strong>Exact</strong> (<code>IndexMode::Exact</code>): <code>evidence.bin</code> stores one <code>(chunk_id, rank)</code> pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. <code>.idx</code> required at query time.</p>
|
||||
<p><strong>Approx</strong> (<code>IndexMode::Approx { b, z }</code>): <code>fingerprint.bin</code> stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No <code>.idx</code> written or needed.</p>
|
||||
<p><strong>Hybrid</strong> (<code>IndexMode::Hybrid { b, z }</code>): both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (O(1)); <code>find_strict()</code> uses exact evidence (O(1)).</p>
|
||||
<h3 id="build-functions">Build functions</h3>
|
||||
<div class="highlight"><pre><span></span><code>MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
|
||||
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
|
||||
Pass 2: sequential iter → fill evidence files + call fill_slot
|
||||
.idx written last for Exact/Hybrid (query-time only)
|
||||
|
||||
MphfLayer::build_exact_evidence(dir, block_bits)
|
||||
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); no .idx required on entry
|
||||
|
||||
MphfLayer::build_approx_evidence(dir, b, z)
|
||||
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||
Uses open_sequential(); never writes .idx
|
||||
</code></pre></div>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper. Callers choose the appropriate post-hoc build directly.</p>
|
||||
<p>In <code>obikpartitionner</code>, <code>build_index_layer</code> receives <code>block_bits: u8</code> from <code>IndexConfig::block_bits</code> and forwards it directly to <code>Layer::build</code> and <code>Layer::build_approx_evidence</code>.</p>
|
||||
<h3 id="membership-verification">Membership verification</h3>
|
||||
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry: decode the kmer from <code>(chunk_id, rank)</code> and compare to the query. A mismatch means the kmer is absent from this layer; probe the next layer.</p>
|
||||
<p>ptr_hash maps any input to a valid slot — it does not natively detect absent keys. Membership is verified using the evidence entry:</p>
|
||||
<ul>
|
||||
<li><strong>Exact</strong>: decode <code>(chunk_id, rank)</code> from <code>evidence.bin</code>; reconstruct the kmer via <code>unitigs.verify_canonical_kmer</code>; compare to query.</li>
|
||||
<li><strong>Approx</strong>: compare <code>kmer.seq_hash()</code> to the b-bit fingerprint stored at the slot.</li>
|
||||
</ul>
|
||||
<p>A mismatch in either mode means the kmer is absent from this layer; probe the next layer.</p>
|
||||
<h3 id="query-algorithm">Query algorithm</h3>
|
||||
<div class="highlight"><pre><span></span><code>fn query(kmer) → Option<(layer_index, slot)>:
|
||||
for (i, layer) in layers.iter().enumerate():
|
||||
slot = layer.mphf.index(kmer)
|
||||
if layer.evidence.decode(slot) matches kmer:
|
||||
if layer.evidence.matches(slot, kmer): // exact or approx dispatch
|
||||
return Some((i, slot))
|
||||
return None
|
||||
</code></pre></div>
|
||||
<p>Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence decode.</p>
|
||||
<p><code>MphfLayer::find</code> dispatches on <code>LayerEvidence</code> at O(1) — no panicking <code>find_exact</code>/<code>find_approx</code> methods. <code>find_strict</code> always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.</p>
|
||||
<h3 id="merging-layers">Merging layers</h3>
|
||||
<p>Two layer chains can be merged by re-indexing their union through the full pipeline. This is expensive (full rebuild) but produces an optimal single-layer index. Merge is a maintenance operation, not a query-path requirement.</p>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../unitig_evidence/">
|
||||
<link rel="prev" href="../evidence_elimination/">
|
||||
|
||||
|
||||
<link rel="next" href="../persistent_compact_int_vec/">
|
||||
@@ -647,6 +647,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -729,6 +757,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Index mode (homogeneity invariant)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -740,6 +779,34 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-api" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query API
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-surface" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build surface
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -751,6 +818,73 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer\<D: LayerData> — MPHF + payload">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-signatures" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build signatures
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
FingerprintVec and FingerprintVecWriter
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
LayeredMap\<D> — collection of layers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="LayeredMap\<D> — collection of layers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#common-methods" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Common methods
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#push_layer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
push_layer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -776,10 +910,10 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-encoding" class="md-nav__link">
|
||||
<a href="#evidence-encoding-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence encoding
|
||||
Evidence encoding (exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -798,14 +932,53 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-path" class="md-nav__link">
|
||||
<a href="#column-append-and-merge-support" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query path
|
||||
Column append and merge support
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Column append and merge support">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-level-genome-column-append" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer-level genome column append
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-matrix-initialisation" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Presence matrix initialisation
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the MPHF is never rebuilt
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -895,6 +1068,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1058,6 +1287,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#index-mode-homogeneity-invariant" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Index mode (homogeneity invariant)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1069,6 +1309,34 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="MphfLayer — autonomous kmer → slot mapping">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-api" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query API
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-surface" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build surface
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1080,6 +1348,73 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer\<D: LayerData> — MPHF + payload">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#build-signatures" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Build signatures
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintvec-and-fingerprintvecwriter" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
FingerprintVec and FingerprintVecWriter
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layeredmapd-collection-of-layers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
LayeredMap\<D> — collection of layers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="LayeredMap\<D> — collection of layers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#common-methods" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Common methods
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#push_layer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
push_layer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1105,10 +1440,10 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidence-encoding" class="md-nav__link">
|
||||
<a href="#evidence-encoding-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Evidence encoding
|
||||
Evidence encoding (exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1127,14 +1462,53 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#query-path" class="md-nav__link">
|
||||
<a href="#column-append-and-merge-support" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Query path
|
||||
Column append and merge support
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Column append and merge support">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-level-genome-column-append" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer-level genome column append
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-matrix-initialisation" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Presence matrix initialisation
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-mphf-is-never-rebuilt" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the MPHF is never rebuilt
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -1178,7 +1552,7 @@
|
||||
|
||||
<h1 id="obilayeredmap-layered-kmer-index-crate">obilayeredmap — layered kmer index crate</h1>
|
||||
<h2 id="purpose">Purpose</h2>
|
||||
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. The index is organised in three levels: <strong>index root → partition → layer</strong>. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
|
||||
<p><code>obilayeredmap</code> implements a persistent, incrementally extensible kmer index. Each layer covers a disjoint kmer set and wraps a <code>ptr_hash</code> MPHF with associated per-slot data. Adding a new dataset never rebuilds existing layers.</p>
|
||||
<hr />
|
||||
<h2 id="three-usage-modes">Three usage modes</h2>
|
||||
<p>The MPHF + evidence infrastructure is the same for all modes. The <strong>payload</strong> varies.</p>
|
||||
@@ -1214,34 +1588,65 @@
|
||||
</table>
|
||||
<p>Both <code>PersistentCompactIntMatrix</code> and <code>PersistentBitMatrix</code> come from the <code>obicompactvec</code> crate.</p>
|
||||
<hr />
|
||||
<h2 id="index-mode-homogeneity-invariant">Index mode (homogeneity invariant)</h2>
|
||||
<p>A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at <code>LayeredMap::open()</code> from <code>PartitionMeta.mode</code> and passed to each <code>Layer::open()</code> — no per-layer file is read.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="cp">#[derive(Serialize, Deserialize, Default)]</span>
|
||||
<span class="cp">#[serde(tag = </span><span class="s">"type"</span><span class="cp">, rename_all = </span><span class="s">"snake_case"</span><span class="cp">)]</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">IndexMode</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="cp">#[default]</span>
|
||||
<span class="w"> </span><span class="n">Exact</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>IndexMode</code> is stored once in <code>PartitionMeta</code> (<code>meta.json</code> at partition root). There is no <code>layer_meta.json</code>.</p>
|
||||
<ul>
|
||||
<li><strong>Exact</strong>: writes <code>evidence.bin</code> + <code>unitigs.bin.idx</code>. Zero false positives.</li>
|
||||
<li><strong>Approx</strong>: writes <code>fingerprint.bin</code> only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No <code>.idx</code> written or required.</li>
|
||||
<li><strong>Hybrid</strong>: writes both <code>fingerprint.bin</code> and <code>evidence.bin</code> + <code>.idx</code>. <code>find()</code> uses the fingerprint (fast, O(1)); <code>find_strict()</code> uses exact evidence.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2 id="mphflayer-autonomous-kmer-slot-mapping">MphfLayer — autonomous kmer → slot mapping</h2>
|
||||
<p><code>MphfLayer</code> encapsulates the MPHF + evidence + unitig spine for one layer. It is independent of any payload data.</p>
|
||||
<p><code>MphfLayer</code> encapsulates the MPHF and evidence store for one layer. It is independent of any payload.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">MphfLayer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="c1">// number of indexed kmers = number of MPHF slots</span>
|
||||
<span class="w"> </span><span class="n">mphf</span><span class="p">:</span><span class="w"> </span><span class="nc">Mphf</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">ev</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="p">,</span><span class="w"> </span><span class="c1">// loaded at open() time</span>
|
||||
<span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Public API:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">MphfLayer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span><span class="w"> </span><span class="c1">// Some(slot) or None</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">unitig_writer</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="n">UnitigFileWriter</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span>
|
||||
<span class="w"> </span><span class="n">dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">fill_slot</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">mut</span><span class="w"> </span><span class="k">impl</span><span class="w"> </span><span class="nb">FnMut</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<p><code>LayerEvidence</code> is an internal enum, not public:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">enum</span><span class="w"> </span><span class="nc">LayerEvidence</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">Exact</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Approx</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs_path</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="n">Hybrid</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">evidence</span><span class="p">:</span><span class="w"> </span><span class="nc">Evidence</span><span class="p">,</span><span class="w"> </span><span class="n">unitigs</span><span class="p">:</span><span class="w"> </span><span class="nc">UnitigFileReader</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="nc">FingerprintVec</span><span class="w"> </span><span class="p">},</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>find</code> returns <code>Some(slot)</code> only after verifying via evidence that the kmer is actually indexed. It returns <code>None</code> for absent keys (ptr_hash maps any input to a valid slot; evidence verification is the only correct-membership test).</p>
|
||||
<p><code>build</code> runs two sequential passes over <code>unitigs.bin</code>:</p>
|
||||
<p><code>MphfLayer::open(dir, mode: &IndexMode)</code> receives the mode from <code>PartitionMeta</code> — no per-layer file is read.</p>
|
||||
<h3 id="query-api">Query API</h3>
|
||||
<p>Two public query methods, both returning <code>Option<usize></code> (slot index):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">find_strict</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<ul>
|
||||
<li><code>find</code>: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.</li>
|
||||
<li><code>find_strict</code>: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no <code>.idx</code>).</li>
|
||||
</ul>
|
||||
<p>There are no <code>find_exact</code>/<code>find_approx</code> methods; panicking dispatch is eliminated.</p>
|
||||
<h3 id="build-surface">Build surface</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// Full MPHF + evidence build (two-pass)</span>
|
||||
<span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span><span class="w"> </span><span class="n">fill_slot</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
|
||||
<span class="c1">// Evidence-only post-hoc builds (MPHF already present)</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">dir</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>MphfLayer::build</code> runs two passes over <code>unitigs.bin</code>:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1</strong>: iterate all canonical kmers in parallel via rayon, construct and store <code>mphf.bin</code>. <code>new_from_par_iter</code> avoids materialising a full key <code>Vec</code>.</li>
|
||||
<li><strong>Pass 2</strong>: iterate again sequentially, fill <code>evidence.bin</code>, call <code>fill_slot(slot, kmer)</code> once per kmer for payload population. A compact <code>n/8</code>-byte seen-bitset verifies MPHF injectivity inline.</li>
|
||||
<li><strong>Pass 1</strong> (parallel via rayon): a <code>CanonicalKmerIter</code> (clonable, <code>Arc<Mmap></code>, no file reopening) is passed to <code>new_from_par_iter</code> via <code>par_bridge()</code>. Produces <code>mphf.bin</code>. No <code>.idx</code> is read or created at this stage.</li>
|
||||
<li><strong>Pass 2</strong> (sequential): fill evidence files; call <code>fill_slot(slot, kmer)</code> per kmer. <code>.idx</code> is written last for Exact/Hybrid modes (query-time only).</li>
|
||||
</ol>
|
||||
<p>For empty layers (n = 0), <code>build</code> returns <code>Ok(0)</code> immediately after creating empty <code>mphf.bin</code> and <code>evidence.bin</code>.</p>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper — callers invoke <code>build_exact_evidence</code> or <code>build_approx_evidence</code> directly.</p>
|
||||
<p>For empty layers (n = 0), all build variants return <code>Ok(0)</code> immediately after creating empty output files.</p>
|
||||
<hr />
|
||||
<h2 id="layerd-layerdata-mphf-payload">Layer\<D: LayerData> — MPHF + payload</h2>
|
||||
<p><code>Layer<D></code> pairs an <code>MphfLayer</code> with one payload store.</p>
|
||||
@@ -1261,7 +1666,7 @@
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">T</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not in the trait.</p>
|
||||
<p><code>LayerData</code> covers the <strong>read path only</strong> (<code>open</code> + <code>read</code>). Build signatures differ between modes and are not part of the trait.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -1288,28 +1693,89 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><strong>Build signatures:</strong></p>
|
||||
<h3 id="build-signatures">Build signatures</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 2</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_from_map</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 3</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentBitMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span>
|
||||
<span class="w"> </span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">n_genomes</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_presence</span><span class="p">(</span><span class="n">out_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">n_genomes</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">present_in</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>All build impls delegate MPHF + evidence construction to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. Mode 2 pre-reads <code>n_kmers</code> from <code>unitigs.bin</code> to size the <code>PersistentCompactIntMatrixBuilder</code> before calling <code>MphfLayer::build</code>. Mode 3 does the same for <code>PersistentBitMatrixBuilder</code>.</p>
|
||||
<p>All build impls delegate to <code>MphfLayer::build</code> via a mode-specific <code>fill_slot</code> callback. The <code>mode</code> parameter is forwarded directly — no <code>LayerMeta</code> is written.</p>
|
||||
<p>Evidence-only post-hoc builds are accessible directly on <code>Layer<D></code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="o"><</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="o">></span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_exact_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">block_bits</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">build_approx_evidence</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">:</span><span class="w"> </span><span class="kt">u8</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>There is no <code>build_evidence</code> dispatch wrapper.</p>
|
||||
<hr />
|
||||
<h2 id="fingerprintvec-and-fingerprintvecwriter">FingerprintVec and FingerprintVecWriter</h2>
|
||||
<p>Approximate evidence is stored as a packed b-bit array, one fingerprint per MPHF slot.</p>
|
||||
<div class="highlight"><pre><span></span><code>fingerprint.bin format:
|
||||
magic: b"FPVF" (4 bytes)
|
||||
b: u8 (bits per fingerprint, 1..=64)
|
||||
padding: [0u8; 3]
|
||||
n: u64 LE (number of slots)
|
||||
data: packed bits, ceil(n*b/8) bytes, Lsb0 order
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">FingerprintVec</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">path</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">get</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u64</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">matches</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="w"> </span><span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">b</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u8</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>matches(slot, hash)</code> extracts the b-bit fingerprint stored at <code>slot</code> and compares it to the low b bits of <code>hash</code>. It is the core operation of <code>find_approx</code>.</p>
|
||||
<hr />
|
||||
<h2 id="layeredmapd-collection-of-layers">LayeredMap\<D> — collection of layers</h2>
|
||||
<p><code>LayeredMap<D></code> wraps <code>Vec<Layer<D>></code> for a single partition directory.</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">LayeredMap</span><span class="o"><</span><span class="n">D</span><span class="p">:</span><span class="w"> </span><span class="nc">LayerData</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="nc">PathBuf</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">meta</span><span class="p">:</span><span class="w"> </span><span class="nc">PartitionMeta</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="n">layers</span><span class="p">:</span><span class="w"> </span><span class="nb">Vec</span><span class="o"><</span><span class="n">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">>></span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>PartitionMeta</code> (<code>meta.json</code> at the partition root) stores <code>n_layers</code>.</p>
|
||||
<h3 id="common-methods">Common methods</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">open</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">create</span><span class="p">(</span><span class="n">root</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="p">:</span><span class="w"> </span><span class="nc">IndexMode</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="bp">Self</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">n_layers</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span>
|
||||
<span class="nc">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">layer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kp">&</span><span class="nc">Layer</span><span class="o"><</span><span class="n">D</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">mode</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kp">&</span><span class="nc">IndexMode</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">Hit</span><span class="o"><</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">></span><span class="p">)</span><span class="o">></span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">next_layer_writer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="n">UnitigFileWriter</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p><code>open</code> reads <code>PartitionMeta</code> once, extracts <code>mode</code>, and passes it to every <code>Layer::open</code> — no per-layer file is read. <code>create</code> stores the given mode in <code>PartitionMeta</code>.</p>
|
||||
<p><code>query</code> probes layers in order and returns <code>(layer_index, Hit)</code> on the first match. Expected probe depth: 1 for kmers in layer 0.</p>
|
||||
<h3 id="push_layer">push_layer</h3>
|
||||
<p><code>push_layer</code> builds the next layer from a <code>unitigs.bin</code> already written via <code>next_layer_writer</code>, using <code>DEFAULT_BLOCK_BITS</code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1">// mode 1</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="c1">// mode 2</span>
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">LayeredMap</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">count_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="n">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">push_layer_from_map</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">counts</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">HashMap</span><span class="o"><</span><span class="n">CanonicalKmer</span><span class="p">,</span><span class="w"> </span><span class="kt">u32</span><span class="o">></span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="kt">usize</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Mode 3 (<code>PersistentBitMatrix</code>) has no <code>push_layer</code> on <code>LayeredMap</code>; callers build directly via <code>Layer<PersistentBitMatrix>::build_presence</code>.</p>
|
||||
<hr />
|
||||
<h2 id="layeredstores-and-aggregation-traits">LayeredStore\<S> and aggregation traits</h2>
|
||||
<p><code>LayeredStore<S></code> is a generic aggregation wrapper over <code>Vec<S></code>. It propagates three traits from <code>obicompactvec::traits</code> up the hierarchy via blanket impls:</p>
|
||||
@@ -1320,11 +1786,6 @@
|
||||
<span class="k">impl</span><span class="o"><</span><span class="n">S</span><span class="p">:</span><span class="w"> </span><span class="nc">BitPartials</span><span class="o">></span><span class="w"> </span><span class="n">BitPartials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">LayeredStore</span><span class="o"><</span><span class="n">S</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="err">…</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// element-wise Σ partials</span>
|
||||
</code></pre></div>
|
||||
<p>Because blanket impls compose, <code>LayeredStore<LayeredStore<S>></code> automatically inherits all three traits when <code>S</code> does — providing the partitioned level without a separate type.</p>
|
||||
<p><strong>Aggregation hierarchy:</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>PersistentCompactIntMatrix implements CountPartials
|
||||
LayeredStore<PersistentCompactIntMatrix> via blanket impl (one partition)
|
||||
LayeredStore<LayeredStore<…>> via blanket impl (partitioned index)
|
||||
</code></pre></div>
|
||||
<p><strong>Leaf implementors</strong> (in <code>obicompactvec</code>):</p>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1344,69 +1805,77 @@ LayeredStore<LayeredStore<…>> via blanket impl
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><code>PersistentCompactIntVec</code> and <code>PersistentBitVec</code> do not implement these traits — they are single-column primitives, not matrix-level aggregators.</p>
|
||||
<p>See <a href="../../architecture/index_architecture/">Kmer index architecture</a> for the full trait API and the two-pass normalised-metric pattern.</p>
|
||||
<hr />
|
||||
<h2 id="on-disk-structure">On-disk structure</h2>
|
||||
<div class="highlight"><pre><span></span><code>index_root/ ← LayeredMap (collection)
|
||||
meta.json
|
||||
part_00000/ ← Partition
|
||||
layer_0/ ← Layer
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index: n_unitigs, n_kmers, seqls[], packed_offsets[]
|
||||
evidence.bin — n × u32, each = (chunk_id: 25 bits | rank: 7 bits), LE
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json {"n": N, "n_cols": 1}
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json {"n": N, "n_cols": G}
|
||||
col_000000.pbiv
|
||||
…
|
||||
layer_1/
|
||||
…
|
||||
part_00001/
|
||||
<div class="highlight"><pre><span></span><code>partition_root/ ← LayeredMap (one partition)
|
||||
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
|
||||
layer_0/ ← Layer
|
||||
mphf.bin — ptr_hash MPHF (epserde format)
|
||||
unitigs.bin — packed 2-bit nucleotide sequences
|
||||
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
|
||||
evidence.bin — [u32; n], LE (Exact/Hybrid only)
|
||||
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
|
||||
counts/ [mode 2] PersistentCompactIntMatrix
|
||||
meta.json
|
||||
col_000000.pciv
|
||||
presence/ [mode 3] PersistentBitMatrix
|
||||
meta.json
|
||||
col_000000.pbiv …
|
||||
layer_1/
|
||||
…
|
||||
</code></pre></div>
|
||||
<p><strong>Partition</strong> (<code>part_XXXXX/</code>): all kmers whose canonical minimiser hashes to this bucket. Partitions are independent and can be processed in parallel.</p>
|
||||
<p><strong>Layer</strong> (<code>layer_N/</code>): one <code>MphfLayer</code> plus optional payload. Layer 0 covers dataset A; layer 1 covers kmers in B absent from A; etc. Layers within a partition are always disjoint.</p>
|
||||
<p>There is no <code>layer_meta.json</code>. The mode is stored once in <code>PartitionMeta</code> and is valid for all layers. <code>unitigs.bin.idx</code> is built at the end of <code>build_exact_evidence</code> — never during MPHF construction — and is consumed at query time only.</p>
|
||||
<hr />
|
||||
<h2 id="evidence-encoding">Evidence encoding</h2>
|
||||
<h2 id="evidence-encoding-exact">Evidence encoding (exact)</h2>
|
||||
<p><code>evidence.bin</code> is a flat <code>[u32; n]</code> array with no header. Each u32 encodes one slot:</p>
|
||||
<div class="highlight"><pre><span></span><code>bits [31:7] = chunk_id (25 bits) — index of the unitig chunk
|
||||
bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
</code></pre></div>
|
||||
<p>Decoding: <code>chunk_id = raw >> 7</code>, <code>rank = raw & 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code>.</p>
|
||||
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity. The structural maximum from superkmer construction is k − m + 1 = 21 kmers/unitig; longer unitigs arise from paths spanning more than one superkmer.</p>
|
||||
<p><code>chunk_id = raw >> 7</code>, <code>rank = raw & 0x7F</code>. Reconstructing the kmer: read k nucleotides at position <code>rank</code> within unitig <code>chunk_id</code> (requires <code>unitigs.bin.idx</code> for random access).</p>
|
||||
<p>For k=31, m=11, the observed maximum is ~46 kmers per chunk — well within the 127-kmer u7 capacity.</p>
|
||||
<hr />
|
||||
<h2 id="ptr_hash-configuration">ptr_hash configuration</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">type</span><span class="w"> </span><span class="nc">Mphf</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PtrHash</span><span class="o"><</span>
|
||||
<span class="w"> </span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="c1">// key type: canonical kmer raw encoding</span>
|
||||
<span class="w"> </span><span class="n">CubicEps</span><span class="p">,</span><span class="w"> </span><span class="c1">// bucket fn: 2.4 bits/key, λ=3.5, α=0.99</span>
|
||||
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">CachelineEf</span><span class="o">>></span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: 11.6 bits/entry (Elias-Fano)</span>
|
||||
<span class="w"> </span><span class="n">CachelineEfVec</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="n">CachelineEf</span><span class="o">>></span><span class="p">,</span><span class="w"> </span><span class="c1">// remap: Elias-Fano</span>
|
||||
<span class="w"> </span><span class="n">Xx64</span><span class="p">,</span><span class="w"> </span><span class="c1">// hasher: XXH3-64 with seed</span>
|
||||
<span class="w"> </span><span class="nb">Vec</span><span class="o"><</span><span class="kt">u8</span><span class="o">></span><span class="p">,</span><span class="w"> </span><span class="c1">// pilots</span>
|
||||
<span class="o">></span><span class="p">;</span>
|
||||
</code></pre></div>
|
||||
<p><code>Xx64</code> is chosen over <code>FxHash</code> because canonical kmer raw values are left-aligned u64 with structural zeros in the low bits (42 zeros for k=11, 2 zeros for k=31), which single-multiply hashes distribute poorly.</p>
|
||||
<p><code>CubicEps</code> with <code>PtrHashParams::<CubicEps>::default()</code> (λ=3.5) is a balanced tradeoff: 2× slower construction than <code>Linear/λ=3.0</code>, 20% less space.</p>
|
||||
<p><code>CubicEps</code> with <code>PtrHashParams::<CubicEps>::default()</code> (λ=3.5): 2× slower construction than <code>Linear/λ=3.0</code>, ~20% less space.</p>
|
||||
<hr />
|
||||
<h2 id="query-path">Query path</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">query</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">kmer</span><span class="p">:</span><span class="w"> </span><span class="nc">CanonicalKmer</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Option</span><span class="o"><</span><span class="n">Hit</span><span class="o"><</span><span class="n">D</span><span class="p">::</span><span class="n">Item</span><span class="o">>></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">mphf</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="n">kmer</span><span class="p">).</span><span class="n">map</span><span class="p">(</span><span class="o">|</span><span class="n">slot</span><span class="o">|</span><span class="w"> </span><span class="n">Hit</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">slot</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">:</span><span class="w"> </span><span class="nc">self</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">slot</span><span class="p">)</span><span class="w"> </span><span class="p">})</span>
|
||||
<h2 id="column-append-and-merge-support">Column append and merge support</h2>
|
||||
<p>These methods extend existing layers with new genome columns without touching the MPHF.</p>
|
||||
<h3 id="layer-level-genome-column-append">Layer-level genome column append</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentBitMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
|
||||
<span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="n">PersistentCompactIntMatrix</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">append_genome_column</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">value_of</span><span class="p">:</span><span class="w"> </span><span class="nc">impl</span><span class="w"> </span><span class="nb">Fn</span><span class="p">(</span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>MphfLayer::find</code> probes the MPHF, decodes evidence, and verifies the kmer — returning <code>Some(slot)</code> on match, <code>None</code> otherwise. <code>data.read(slot)</code> is called only on a confirmed hit.</p>
|
||||
<p>In <code>LayeredMap</code>, layers are probed in order; the first match wins. Expected probe depth: 1 for kmers in layer 0.</p>
|
||||
<p>Both delegate to the corresponding <code>PersistentBitMatrix::append_column</code> / <code>PersistentCompactIntMatrix::append_column</code>. They write a new column file (<code>col_NNNNNN.pbiv</code> / <code>col_NNNNNN.pciv</code>) and update <code>meta.json</code> to increment <code>n_cols</code>. <code>value_of</code> is called once per slot (0..n).</p>
|
||||
<h3 id="presence-matrix-initialisation">Presence matrix initialisation</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">impl</span><span class="w"> </span><span class="n">Layer</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">init_presence_matrix</span><span class="p">(</span><span class="n">layer_dir</span><span class="p">:</span><span class="w"> </span><span class="kp">&</span><span class="nc">Path</span><span class="p">,</span><span class="w"> </span><span class="n">n_kmers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">OLMResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Called on the first merge of a Presence-mode index. Creates <code>presence/</code> with <code>meta.json {"n": n_kmers, "n_cols": 1}</code> and <code>col_000000.pbiv</code> set entirely to <code>true</code>. This retroactively records genome 0 (the original source) as present in every slot, satisfying the column-count invariant before any new-source column is appended.</p>
|
||||
<h3 id="why-the-mphf-is-never-rebuilt">Why the MPHF is never rebuilt</h3>
|
||||
<p>The MPHF, evidence, and unitigs are built once from the kmer set of a layer and are immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only appends a new data column indexed by the same slot numbers. The only disk writes are one new <code>.pciv</code>/<code>.pbiv</code> file and a single <code>meta.json</code> update.</p>
|
||||
<hr />
|
||||
<h2 id="add-layer-algorithm">Add-layer algorithm</h2>
|
||||
<p>When adding dataset B to an existing index:</p>
|
||||
<ol>
|
||||
<li>For each partition, probe existing layers for kmers of B routed to that partition.</li>
|
||||
<li>Collect kmers absent from all layers → <code>B \ index</code>.</li>
|
||||
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>MphfLayer::unitig_writer</code>.</li>
|
||||
<li>Call <code>Layer<D>::build</code> on the new directory.</li>
|
||||
<li>Update <code>meta.json</code>.</li>
|
||||
<li>Write <code>B \ index</code> to a new <code>unitigs.bin</code> via <code>next_layer_writer()</code>.</li>
|
||||
<li>Call <code>Layer<D>::build</code> (or <code>build_presence</code>) on the new layer directory.</li>
|
||||
<li>Call <code>push_layer</code> (or <code>append_layer</code>) to register the new layer in <code>meta.json</code>.</li>
|
||||
</ol>
|
||||
<p>Each partition's new layer is built independently; the operation is fully parallel across partitions.</p>
|
||||
<hr />
|
||||
@@ -1433,11 +1902,15 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>memmap2 0.9</code></td>
|
||||
<td>mmap of evidence and payload files</td>
|
||||
<td>mmap of evidence and fingerprint files</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>bitvec</code></td>
|
||||
<td>packed b-bit fingerprint storage</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>obiskio</code></td>
|
||||
<td>unitig file writer/reader</td>
|
||||
<td>unitig file writer/reader + <code>.idx</code> build</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>obicompactvec</code></td>
|
||||
@@ -1448,8 +1921,8 @@ bits [6:0] = rank (7 bits) — kmer index within the chunk (0-based)
|
||||
<td>parallel MPHF construction pass</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ndarray 0.16</code></td>
|
||||
<td>aggregation output arrays</td>
|
||||
<td><code>serde / serde_json</code></td>
|
||||
<td><code>PartitionMeta</code> serialisation</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -662,6 +662,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#make_pipe-dsl" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
make_pipe! DSL
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
@@ -801,6 +812,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -879,6 +918,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1087,6 +1182,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#make_pipe-dsl" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
make_pipe! DSL
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
@@ -1145,7 +1251,7 @@
|
||||
|
||||
|
||||
<h1 id="obipipeline-parallel-pipeline-library">obipipeline — parallel pipeline library</h1>
|
||||
<p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>transforms</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p>
|
||||
<p><code>obipipeline</code> is a generic, multi-threaded data pipeline crate. It connects a <strong>source</strong>, a chain of <strong>stages</strong>, and a <strong>sink</strong> via crossbeam channels, running each stage with a shared worker pool and a biased scheduler.</p>
|
||||
<h2 id="core-types">Core types</h2>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1158,22 +1264,33 @@
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>SourceFn<D></code></td>
|
||||
<td><code>Box<dyn FnMut() -> Result<D, PipelineError> + Send+Sync></code></td>
|
||||
<td><code>Box<dyn FnMut() -> Result<D, PipelineError> + Send></code></td>
|
||||
<td>Called repeatedly; <code>FnMut</code> because it holds iterator state</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SharedFn<D></code></td>
|
||||
<td><code>Arc<dyn Fn(D) -> Result<D, PipelineError> + Send+Sync></code></td>
|
||||
<td>Shared across workers via <code>Arc::clone</code> (no copy of the closure)</td>
|
||||
<td><code>Arc<dyn Fn(D) -> Result<D, PipelineError> + Send + Sync></code></td>
|
||||
<td>1→1 transform shared across workers via <code>Arc::clone</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SharedFlatFn<D></code></td>
|
||||
<td><code>Arc<dyn Fn(D, &Sender<Result<D, _>>, &Sender<isize>) + Send + Sync></code></td>
|
||||
<td>1→N transform; pushes items into channel, sends delta</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SinkFn<D></code></td>
|
||||
<td><code>Box<dyn Fn(D) -> Result<(), PipelineError> + Send+Sync></code></td>
|
||||
<td><code>Box<dyn Fn(D) -> Result<(), PipelineError> + Send></code></td>
|
||||
<td>Final consumer; returns <code>Result</code> so errors propagate back</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><code>Pipeline<D></code> holds one <code>SourceFn</code>, a <code>Vec<SharedFn></code>, and one <code>SinkFn</code>.<br />
|
||||
<p>Stages come in two variants:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="nc">Stage</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">Transform</span><span class="p">(</span><span class="n">SharedFn</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→1</span>
|
||||
<span class="w"> </span><span class="n">Flat</span><span class="p">(</span><span class="n">SharedFlatFn</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">),</span><span class="w"> </span><span class="c1">// 1→N</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>Pipeline<D></code> holds one <code>SourceFn</code>, a <code>Vec<Stage></code>, and one <code>SinkFn</code>.<br />
|
||||
<code>WorkerPool<D></code> wraps a <code>Pipeline</code> with <code>n_workers</code> and channel <code>capacity</code>.</p>
|
||||
<h2 id="workerpool">WorkerPool</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="n">WorkerPool</span><span class="p">::</span><span class="n">new</span><span class="p">(</span><span class="n">pipeline</span><span class="p">:</span><span class="w"> </span><span class="nc">Pipeline</span><span class="o"><</span><span class="n">D</span><span class="o">></span><span class="p">,</span><span class="w"> </span><span class="n">n_workers</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">capacity</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">Self</span>
|
||||
@@ -1193,7 +1310,7 @@
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>capacity</code></td>
|
||||
<td>Bound on every crossbeam channel in the pipeline (source output, inter-stage channels, worker input, sink input, sink error). Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td>
|
||||
<td>Bound on every crossbeam channel in the pipeline. Controls memory and back-pressure: a full channel blocks the sender until a slot frees.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
@@ -1208,7 +1325,7 @@
|
||||
</code></pre></div>
|
||||
<p>Each variant carries the concrete type for one stage's output. The macros pattern-match on this enum to route values between stages.</p>
|
||||
<h2 id="macros">Macros</h2>
|
||||
<p>Six low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p>
|
||||
<p>Eight low-level macros build individual stages; one high-level macro (<code>make_pipeline!</code>) composes them.</p>
|
||||
<h3 id="low-level">Low-level</h3>
|
||||
<div class="highlight"><pre><span></span><code><span class="n">make_source</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields T</span>
|
||||
<span class="n">make_source_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">iterator</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// iterator yields Result<T, E></span>
|
||||
@@ -1216,6 +1333,9 @@
|
||||
<span class="n">make_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> U</span>
|
||||
<span class="n">make_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<U, E></span>
|
||||
|
||||
<span class="n">make_flat_transform</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> impl IntoIterator<Item=U></span>
|
||||
<span class="n">make_flat_transform_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">,</span><span class="w"> </span><span class="n">OutputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<impl IntoIterator<Item=U>, E></span>
|
||||
|
||||
<span class="n">make_sink</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> ()</span>
|
||||
<span class="n">make_sink_fallible</span><span class="o">!</span><span class="p">(</span><span class="n">Enum</span><span class="p">,</span><span class="w"> </span><span class="n">func</span><span class="p">,</span><span class="w"> </span><span class="n">InputVariant</span><span class="p">)</span><span class="w"> </span><span class="c1">// func: T -> Result<(), E></span>
|
||||
</code></pre></div>
|
||||
@@ -1224,17 +1344,31 @@
|
||||
<div class="highlight"><pre><span></span><code>make_pipeline! {
|
||||
DataEnum,
|
||||
source iterator => OutputVariant, // or source? for fallible
|
||||
| func: In => Out, // non-fallible transform
|
||||
|? func: In => Out, // fallible transform
|
||||
| func: In => Out, // 1→1 non-fallible transform
|
||||
|? func: In => Out, // 1→1 fallible transform
|
||||
|| func: In => Out, // 1→N non-fallible flat transform
|
||||
||? func: In => Out, // 1→N fallible flat transform
|
||||
sink func @ InputVariant, // or sink? for fallible
|
||||
}
|
||||
</code></pre></div>
|
||||
<p><code>?</code> marks fallibility on source, individual transforms, or sink independently.<br />
|
||||
Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</code> recurses over transform tokens one at a time, accumulating them into a <code>vec![]</code>, then terminates on <code>sink</code>/<code>sink?</code>.</p>
|
||||
<h3 id="make_pipe-dsl">make_pipe! DSL</h3>
|
||||
<p><code>make_pipe!</code> builds a sourceless/sinkless <code>Pipe<D, In, Out></code> — a reusable, composable stage sequence:</p>
|
||||
<div class="highlight"><pre><span></span><code>make_pipe! {
|
||||
DataEnum : InType => OutType,
|
||||
| func: InVariant => OutVariant,
|
||||
|? func: InVariant => OutVariant,
|
||||
|| func: InVariant => OutVariant,
|
||||
||? func: InVariant => OutVariant,
|
||||
}
|
||||
</code></pre></div>
|
||||
<p>Two pipes compose with <code>.then(other)</code>. Apply to an iterator with <code>.apply(iter, n_workers, capacity)</code> to get a <code>PipeIter<Out></code> — an iterator over the pipeline output, backed by a background <code>WorkerPool</code>. The scatter step in <code>obikmer</code> uses <code>make_pipe!</code> and <code>.apply()</code> rather than the full <code>make_pipeline!</code> / <code>WorkerPool</code> pattern.</p>
|
||||
<h2 id="scheduler-architecture">Scheduler architecture</h2>
|
||||
<div class="highlight"><pre><span></span><code>Source thread ──► [source_rx] ──► Scheduler ──► [worker_tx] ──► Workers (×N)
|
||||
▲ │
|
||||
[stage_rxs] ────────┘◄──────────────────────────────┘
|
||||
[flat_delta_rx] ──► Scheduler (in_flight adjustment)
|
||||
│
|
||||
[sink_err_rx] ← errors from sink (highest priority)
|
||||
│
|
||||
@@ -1242,20 +1376,20 @@ Implemented as a <strong>TT muncher</strong>: the internal rule <code>@build</co
|
||||
</code></pre></div>
|
||||
<p>The scheduler is a single thread running a biased <code>Select</code> over all input channels. Priority order (highest first):</p>
|
||||
<div class="highlight"><pre><span></span><code>index 0 sink_err_rx abort on sink error
|
||||
index 1 stage_rxs[N-1] drain last stage first
|
||||
...
|
||||
index N stage_rxs[0]
|
||||
index N+1 source_rx pull new data last
|
||||
index 1 flat_delta_rx adjust in_flight before dispatching
|
||||
index 2..=n+1 stage_rxs[n-1..0] drain last stage first
|
||||
index n+2 source_rx pull new data last
|
||||
</code></pre></div>
|
||||
<p>This back-pressure-friendly ordering ensures downstream stages are drained before new items enter the pipeline.</p>
|
||||
<p><strong>Workers</strong> are generic: each receives <code>(data, SharedFn, result_tx)</code> and calls <code>f(data)</code>, sending the result to the provided channel. The scheduler decides which transform to apply and where to route the result.</p>
|
||||
<p><strong>Termination</strong> uses an <code>in_flight</code> counter:</p>
|
||||
<p><strong>Workers</strong> are generic: each receives a <code>WorkerTask</code> — either <code>Transform(data, stage_idx)</code> or <code>Flat(data, stage_idx)</code>. For <code>Transform</code>, the worker calls <code>f(data)</code> and sends the result to <code>stage_txs[stage_idx]</code>. For <code>Flat</code>, the worker calls <code>f(data, &push_tx, &delta_tx)</code>: the closure pushes N items into <code>push_tx</code> then sends <code>N-1</code> to <code>delta_tx</code>. The scheduler uses the delta to adjust <code>in_flight</code> without knowing N in advance.</p>
|
||||
<p><strong>Termination</strong> uses an <code>in_flight: isize</code> counter and a <code>flat_workers_active: usize</code> counter:</p>
|
||||
<ul>
|
||||
<li>incremented when an item is dispatched from source to workers</li>
|
||||
<li>decremented when the item exits the last stage</li>
|
||||
<li>the loop exits only when <code>source_done && in_flight == 0</code></li>
|
||||
<li><code>in_flight</code> incremented when an item is dispatched from source to workers</li>
|
||||
<li><code>in_flight</code> decremented when the item exits the last stage to the sink</li>
|
||||
<li><code>flat_workers_active</code> incremented when a <code>Flat</code> task is dispatched, decremented when the delta arrives</li>
|
||||
<li>the loop exits only when <code>source_done && in_flight == 0 && flat_workers_active == 0</code></li>
|
||||
</ul>
|
||||
<p>This guarantees all in-flight items complete before <code>join()</code>.</p>
|
||||
<p>This guarantees all in-flight items complete (including all N outputs of a flat stage) before <code>join()</code>.</p>
|
||||
<h2 id="error-handling">Error handling</h2>
|
||||
<p><code>PipelineError</code> has four variants:</p>
|
||||
<table>
|
||||
@@ -1279,7 +1413,7 @@ index N+1 source_rx pull new data last
|
||||
<td>Internal routing error</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StepError(Box<dyn Error>)</code></td>
|
||||
<td><code>StepError(Box<dyn Error + Send + Sync>)</code></td>
|
||||
<td>Error from user code (wrapped by <code>make_*_fallible!</code>)</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -12,7 +12,7 @@
|
||||
<link rel="prev" href="../persistent_compact_int_vec/">
|
||||
|
||||
|
||||
<link rel="next" href="../../architecture/sequences/invariant/">
|
||||
<link rel="next" href="../merge/">
|
||||
|
||||
|
||||
|
||||
@@ -649,6 +649,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -1002,6 +1030,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -649,6 +649,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -985,6 +1013,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -773,6 +773,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -851,6 +879,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1104,7 +1188,9 @@
|
||||
<li><strong>error valley</strong> → suggests min_count (typically the local minimum between the error peak and the coverage peak)</li>
|
||||
</ul>
|
||||
<h2 id="phase-1-scatter">Phase 1 — Scatter</h2>
|
||||
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored. For each read:</p>
|
||||
<p>Single streaming pass over raw input files (FASTA/FASTQ, gzip). FASTQ quality scores are ignored.</p>
|
||||
<p>Input files are read via <code>open_nuc_stream</code>, which opens and decompresses the file, auto-detects the format (FASTA / FASTQ / GenBank), and yields a sequence of <code>NucPage</code> buffers. Each <code>NucPage</code> is a flat 64 KB buffer of normalised bytes (<code>ACGT</code> + <code>\x00</code> separators), carrying a k−1 byte overlap from the preceding page so that no k-mer is lost at page boundaries. Per-record identity (sequence id, raw bytes) is not preserved; this is intentional — the scatter phase only needs normalised bases to produce superkmers.</p>
|
||||
<p>For each read fragment within a page:</p>
|
||||
<ol>
|
||||
<li><strong>Ambiguous base filter</strong>: cut at any non-ACGT base; discard fragments shorter than k.</li>
|
||||
<li><strong>Entropy filter</strong>: scan each fragment with a sliding window of size k. When the kmer <span class="arithmatex">\(K_i = S[i \mathinner{..} i+k-1]\)</span> ended by nucleotide <span class="arithmatex">\(S[j]\)</span> (with <span class="arithmatex">\(j = i+k-1\)</span>) has entropy below threshold <span class="arithmatex">\(\theta\)</span>, emit the current segment and start a new one (see algorithm below). <span class="arithmatex">\(K_i\)</span> belongs to neither segment, and no valid kmer is lost.</li>
|
||||
@@ -1154,8 +1240,13 @@ B ≈ 100 is tunable; RAM needed ≈ partition_size / B.</p>
|
||||
for each kmer in sequence:
|
||||
kmer_counts[canonical(kmer)] += COUNT
|
||||
</code></pre></div>
|
||||
<p>Implemented as an external sort or a temporary HashMap, depending on partition size. At the end of this phase, each distinct canonical kmer has its exact total count.</p>
|
||||
<p>Abundance filter applied here: kmers with <code>total_count < q</code> are discarded. <code>q</code> is a collection parameter (0 = keep all, including singletons for ≤1x data).</p>
|
||||
<p>Implemented as a three-step pipeline in <code>count_partition()</code>:</p>
|
||||
<ol>
|
||||
<li><strong>External sort</strong> (<code>kmer_sort::sort_unique_kmers</code>): read dereplicated superkmers, extract canonical kmer raw <code>u64</code> values, sort in RAM-bounded chunks (adaptive: 40% of available RAM ÷ n_threads, min 1 M kmers/chunk), k-way merge with inline dedup → <code>sorted_unique.bin</code>. f0 is now known exactly.</li>
|
||||
<li><strong>Provisional MPHF</strong> (ptr_hash): built from <code>sorted_unique.bin</code> via <code>new_from_par_iter(f0, ...)</code>. Stored to <code>mphf1.bin</code>; <code>sorted_unique.bin</code> deleted immediately.</li>
|
||||
<li><strong>Accumulation pass</strong>: re-read dereplicated superkmers; for each kmer, <code>slot = mphf.index(kmer.raw())</code>, increment <code>counts1[slot]</code> by the superkmer COUNT. Stored in a <code>PersistentCompactIntVec</code> (<code>counts1.bin</code>).</li>
|
||||
</ol>
|
||||
<p>At the end of this phase, each distinct canonical kmer has its exact total count, and the frequency spectrum (<code>spectrums/{label}.json</code>) is written to the index root.</p>
|
||||
<p>No pre-filter on super-kmer COUNT is possible at phase 2: a super-kmer with COUNT=1 may contain only high-abundance kmers, each present in many other super-kmers across the partition.</p>
|
||||
<h2 id="phase-4-super-kmer-compaction">Phase 4 — Super-kmer compaction</h2>
|
||||
<p>The valid kmer set from phase 3 is used as a mask to rewrite the super-kmer files:</p>
|
||||
@@ -1188,14 +1279,52 @@ branching / dead-end → unitig start or end
|
||||
<p>Output: <code>unitigs.bin</code> — the permanent evidence structure for the partition. Each kmer in the partition appears at exactly one (unitig_id, offset) location.</p>
|
||||
<p><strong>Scope of local unitigs:</strong> these are unitigs of the partition's local de Bruijn graph, not global unitigs. A kmer whose k-1 successor or predecessor falls in another partition appears as a dead end locally and terminates the unitig. This does not affect correctness of verification but means partition-local unitigs cannot be directly reused for global assembly.</p>
|
||||
<h2 id="phase-6-mphf-construction-and-index-finalisation">Phase 6 — MPHF construction and index finalisation</h2>
|
||||
<p>Built once on the definitive kmer set (all kmers in all unitigs of the partition). See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for the current implementation.</p>
|
||||
<div class="highlight"><pre><span></span><code>kmers from unitigs → MPHF → mphf.bin
|
||||
→ evidence.bin : n × u32, each = (chunk_id: 25 bits | rank: 7 bits)
|
||||
→ payload : counts/ (mode 2) or presence/ (mode 3)
|
||||
<p><code>build_index_layer</code> is called per partition (in parallel via <code>build_layers</code>) with the following parameters sourced from <code>IndexConfig</code>:</p>
|
||||
<ul>
|
||||
<li><code>block_bits</code> — from <code>IndexConfig::block_bits</code>; controls the <code>.idx</code> block size (2^block_bits unitig chunks per block) for exact evidence</li>
|
||||
<li><code>evidence</code> — <code>EvidenceKind::Exact</code> or <code>EvidenceKind::Approx { b, z }</code>; propagated unchanged from <code>IndexConfig::evidence</code></li>
|
||||
<li><code>min_ab</code> / <code>max_ab</code> — abundance bounds applied before graph construction</li>
|
||||
<li><code>with_counts</code> — whether to store kmer counts alongside set membership</li>
|
||||
</ul>
|
||||
<p><strong>Abundance filtering:</strong> when <code>min_ab > 1</code> or <code>max_ab.is_some()</code>, the provisional <code>mphf1.bin</code> and <code>counts1.bin</code> produced in phase 3 are memory-mapped. Each canonical kmer is accepted only if its count in <code>counts1</code> satisfies the bounds. If either file is absent, filtering is skipped (all kmers accepted).</p>
|
||||
<div class="highlight"><pre><span></span><code>for each kmer in dereplicated super-kmer:
|
||||
ab = counts1[mphf1.index(kmer.raw())]
|
||||
if ab < min_ab || ab > max_ab: skip
|
||||
graph.push(kmer)
|
||||
</code></pre></div>
|
||||
<p>The MPHF is built in two passes over <code>unitigs.bin</code>: parallel pass for <code>mphf.bin</code>, sequential pass for <code>evidence.bin</code> and payload. The exact kmer count is available from the unitig index (<code>unitigs.bin.idx</code>) before the passes begin.</p>
|
||||
<p><strong>Exact verification via unitig evidence:</strong></p>
|
||||
<p><code>unitigs.bin</code> serves as the evidence structure. The MPHF maps every input to <code>[0, N)</code> including absent kmers — the unitig read-back (via <code>evidence.bin</code>) is the only correct membership test.</p>
|
||||
<p><strong>Graph build and unitig write:</strong></p>
|
||||
<p>The surviving kmers are fed into <code>GraphDeBruijn</code>, which computes degrees and yields unitigs. Unitigs are written to <code>layer_0/unitigs.bin</code> via a <code>UnitigFileWriter</code>.</p>
|
||||
<p><strong>MPHF and evidence build:</strong></p>
|
||||
<p><code>Layer::build</code> (membership-only) or <code>Layer::<PersistentCompactIntMatrix>::build</code> (with counts) is called next. Internally, <code>MphfLayer::build</code> performs two passes:</p>
|
||||
<ol>
|
||||
<li><strong>Pass 1 (parallel):</strong> build <code>unitigs.bin.idx</code> (block size = 2^<code>block_bits</code>) then construct the MPHF from all canonical kmers in <code>unitigs.bin</code>; store to <code>mphf.bin</code>.</li>
|
||||
<li><strong>Pass 2 (sequential):</strong> for each kmer in <code>unitigs.bin</code>, compute its slot and write <code>evidence.bin</code> (<code>chunk_id: 25 bits | rank: 7 bits</code> packed into a <code>u32</code>); also invoke the payload callback (<code>fill_slot</code>) to populate <code>counts/</code> if <code>with_counts</code>.</li>
|
||||
</ol>
|
||||
<p>After <code>Layer::build</code> completes, <code>layer_meta.json</code> records <code>EvidenceKind::Exact</code>.</p>
|
||||
<p><strong>Approximate evidence override:</strong></p>
|
||||
<p>If <code>evidence</code> is <code>EvidenceKind::Approx { b, z }</code>, <code>build_approx_evidence</code> is called immediately after <code>Layer::build</code>. It overwrites the exact evidence bundle with <code>fingerprint.bin</code> (b-bit hash per slot) and rewrites <code>layer_meta.json</code> with <code>EvidenceKind::Approx { b, z }</code>. No <code>.idx</code> file is needed at query time in this mode.</p>
|
||||
<div class="highlight"><pre><span></span><code>// Exact path → evidence.bin + unitigs.bin.idx + layer_meta.json(Exact)
|
||||
// Approx path → fingerprint.bin + layer_meta.json(Approx{b,z})
|
||||
// (evidence.bin left on disk but not used)
|
||||
</code></pre></div>
|
||||
<p><strong>Partition metadata:</strong></p>
|
||||
<p>After all layer files are written, <code>PartitionMeta { n_layers: 1 }</code> is serialised to <code>index/meta.json</code> inside the partition directory. This file is required by <code>LayeredMap::open</code> for subsequent merge operations.</p>
|
||||
<p><strong>File layout per partition after phase 6:</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>part_XXXXX/
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers: 1 }
|
||||
layer_0/
|
||||
unitigs.bin ← permanent evidence (all modes)
|
||||
unitigs.bin.idx ← block index (exact mode only)
|
||||
mphf.bin ← MPHF
|
||||
evidence.bin ← exact evidence (exact mode)
|
||||
fingerprint.bin ← b-bit fingerprints (approx mode)
|
||||
layer_meta.json ← EvidenceKind tag
|
||||
counts/ ← PersistentCompactIntMatrix (with_counts only)
|
||||
</code></pre></div>
|
||||
<p><strong>Cleanup:</strong> unless <code>--keep-intermediate</code> is set, <code>remove_build_artifacts</code> deletes <code>dereplicated.skmer.zst</code>, <code>mphf1.bin</code>, and <code>counts1.bin</code> after all partitions are indexed.</p>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap</a> and <a href="../mphf/">MPHF selection</a> for data structure details.</p>
|
||||
<p><strong>Query path (exact evidence):</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>query kmer q
|
||||
→ canonical_minimizer(q) → hash → PART → part_XXXXX/
|
||||
→ MPHF(q) → slot s
|
||||
@@ -1204,7 +1333,13 @@ branching / dead-end → unitig start or end
|
||||
→ match : return payload[s] ← exact hit
|
||||
→ no match: kmer absent ← MPHF collision on absent kmer
|
||||
</code></pre></div>
|
||||
<p><code>superkmers.bin.gz</code> is no longer needed at this point and can be deleted.</p>
|
||||
<p><strong>Query path (approximate evidence):</strong></p>
|
||||
<div class="highlight"><pre><span></span><code>query kmer q
|
||||
→ MPHF(q) → slot s
|
||||
→ fingerprint[s] matches seq_hash(q)?
|
||||
→ yes : probable hit (FP rate = 1/2^b per kmer, 1/2^(b·z) per z-window)
|
||||
→ no : kmer absent
|
||||
</code></pre></div>
|
||||
<div class="footnote">
|
||||
<hr />
|
||||
<ol>
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -64,7 +64,7 @@
|
||||
<div data-md-component="skip">
|
||||
|
||||
|
||||
<a href="#on-disk-collection-structure" class="md-skip">
|
||||
<a href="#on-disk-index-layout" class="md-skip">
|
||||
Skip to content
|
||||
</a>
|
||||
|
||||
@@ -575,6 +575,24 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
On-disk storage
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
|
||||
|
||||
@@ -592,6 +610,174 @@
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -659,6 +845,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -737,6 +951,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -874,6 +1144,163 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#directory-tree" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Directory tree
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#state-machine-sentinels" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
State machine (sentinels)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexmeta-indexmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
index.meta (IndexMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer-files" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Layer files
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Layer files">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#unitigsbinidx-exact-only" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
unitigs.bin.idx (Exact only)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphfbin" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
mphf.bin
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#layer_metajson-layermeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
layer_meta.json (LayerMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#evidencebin-exact" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
evidence.bin (Exact)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#fingerprintbin-approx" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
fingerprint.bin (Approx)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#counts-persistentcompactintmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
counts/ (PersistentCompactIntMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#presence-persistentbitmatrix" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
presence/ (PersistentBitMatrix)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#metajson-partitionmeta" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
meta.json (PartitionMeta)
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
@@ -889,9 +1316,131 @@
|
||||
|
||||
|
||||
|
||||
<h1 id="on-disk-collection-structure">On-disk collection structure</h1>
|
||||
<p>See <a href="../obilayeredmap/">obilayeredmap crate</a> for the current on-disk layout.</p>
|
||||
<p>The index root contains one <code>part_XXXXX/</code> directory per partition, each holding one or more <code>layer_N/</code> directories. Each layer directory contains <code>mphf.bin</code>, <code>unitigs.bin</code>, <code>unitigs.bin.idx</code>, <code>evidence.bin</code>, and optionally a <code>counts/</code> or <code>presence/</code> payload directory.</p>
|
||||
<h1 id="on-disk-index-layout">On-disk index layout</h1>
|
||||
<h2 id="directory-tree">Directory tree</h2>
|
||||
<div class="highlight"><pre><span></span><code><index_root>/
|
||||
index.meta ← JSON: IndexMeta
|
||||
scatter.done ← sentinel: scatter phase complete
|
||||
count.done ← sentinel: dereplicate + count complete
|
||||
index.done ← sentinel: MPHF index fully built
|
||||
spectrums/
|
||||
<label>.json ← kmer frequency spectrum per genome
|
||||
partitions/
|
||||
part_00000/ ← one dir per partition (zero-padded 5 digits, 0..2^n_bits−1)
|
||||
index/
|
||||
meta.json ← PartitionMeta { n_layers }
|
||||
layer_0/
|
||||
unitigs.bin ← binary unitig sequences (2-bit packed)
|
||||
unitigs.bin.idx ← block-sampled offset index (exact evidence only)
|
||||
mphf.bin ← serialised PtrHash MPHF
|
||||
layer_meta.json ← LayerMeta { evidence: EvidenceKind }
|
||||
evidence.bin ← chunk_id:rank per MPHF slot (Exact only)
|
||||
fingerprint.bin ← b-bit fingerprints per MPHF slot (Approx only)
|
||||
counts/ ← PersistentCompactIntMatrix (if with_counts=true)
|
||||
presence/ ← PersistentBitMatrix (if presence mode, merge)
|
||||
layer_1/ ← added by merge; same structure as layer_0
|
||||
layer_2/ …
|
||||
part_00001/ …
|
||||
</code></pre></div>
|
||||
<h2 id="state-machine-sentinels">State machine (sentinels)</h2>
|
||||
<p>The sentinels are touched atomically at the end of each pipeline stage.
|
||||
A partial run (e.g. scatter interrupted) leaves no sentinel; the state is
|
||||
detected as the lowest sentinel present.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>State</th>
|
||||
<th>Sentinel present</th>
|
||||
<th>Meaning</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>Empty</code></td>
|
||||
<td>—</td>
|
||||
<td><code>index.meta</code> exists; scatter not started or interrupted</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Scattered</code></td>
|
||||
<td><code>scatter.done</code></td>
|
||||
<td>All super-kmers routed to partition files</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Counted</code></td>
|
||||
<td><code>count.done</code></td>
|
||||
<td>Partitions dereplicated; <code>spectrums/</code> written</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>Indexed</code></td>
|
||||
<td><code>index.done</code></td>
|
||||
<td>All MPHF layers built; index ready for queries</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="indexmeta-indexmeta">index.meta (IndexMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"version"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"config"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="nt">"kmer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">31</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"minimizer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">11</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"n_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"with_counts"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Exact"</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="nt">"block_bits"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span>
|
||||
<span class="w"> </span><span class="p">},</span>
|
||||
<span class="w"> </span><span class="nt">"genomes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
|
||||
<span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"label"</span><span class="p">:</span><span class="w"> </span><span class="s2">"genome_A"</span><span class="p">,</span><span class="w"> </span><span class="nt">"meta"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"species"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Homo sapiens"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="w"> </span><span class="p">]</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>n_bits</code> determines the partition count: <code>2^n_bits</code> directories under <code>partitions/</code>.</p>
|
||||
<p><code>evidence</code> is either the string <code>"Exact"</code> or <code>{"Approx": {"b": 8, "z": 1}}</code>.</p>
|
||||
<p><code>block_bits</code> controls the <code>.idx</code> granularity: one offset entry every <code>2^block_bits</code>
|
||||
chunks. <code>block_bits=0</code> stores one entry per chunk (O(1) random access, largest <code>.idx</code>).</p>
|
||||
<p><code>GenomeInfo.meta</code> is a free-form string→string map for categorical metadata (e.g.
|
||||
taxonomy, sample origin). It is optional; defaults to empty.</p>
|
||||
<h2 id="layer-files">Layer files</h2>
|
||||
<h3 id="unitigsbin">unitigs.bin</h3>
|
||||
<p>2-bit packed binary unitig sequences. Each record: 1 byte <code>seql_minus_k</code>
|
||||
(nucleotide length − k), followed by <code>ceil((seql_minus_k + k) / 4)</code> bytes of
|
||||
packed sequence. Long unitigs are transparently split into overlapping chunks
|
||||
(k−1 nucleotide overlap) so no k-mer crosses a chunk boundary.</p>
|
||||
<h3 id="unitigsbinidx-exact-only">unitigs.bin.idx (Exact only)</h3>
|
||||
<p>Magic <code>UIX3</code>, little-endian header: <code>block_bits</code> (u32), <code>n_unitigs</code> (u32),
|
||||
<code>n_kmers</code> (u64), then <code>ceil(n_unitigs / 2^block_bits) + 1</code> byte-offset entries
|
||||
(u32 each, last entry is a sentinel past-end offset). Absent for Approx layers.</p>
|
||||
<h3 id="mphfbin">mphf.bin</h3>
|
||||
<p>PtrHash MPHF serialised with epserde. Maps canonical kmer (u64, left-aligned
|
||||
2-bit) to a slot index in <code>[0, n_kmers)</code>.</p>
|
||||
<h3 id="layer_metajson-layermeta">layer_meta.json (LayerMeta)</h3>
|
||||
<p><div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"exact"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
or
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"approx"</span><span class="p">,</span><span class="w"> </span><span class="nt">"b"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w"> </span><span class="nt">"z"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div></p>
|
||||
<h3 id="evidencebin-exact">evidence.bin (Exact)</h3>
|
||||
<p>One <code>(chunk_id: u32, rank: u8)</code> record per MPHF slot, packed. Used to verify
|
||||
that the kmer mapped to a slot is actually present: <code>unitigs.bin[chunk_id][rank]</code>
|
||||
is re-read and compared against the query.</p>
|
||||
<h3 id="fingerprintbin-approx">fingerprint.bin (Approx)</h3>
|
||||
<p><code>b</code>-bit fingerprint per MPHF slot derived from the kmer's sequence hash.
|
||||
False-positive rate per query ≈ <code>1/2^b</code>. With Findere parameter <code>z ≥ 2</code>,
|
||||
<code>z</code> consecutive k-mers must all match, reducing the effective FP rate to
|
||||
approximately <code>W / 2^(b·z)</code> per read of length <code>L</code>
|
||||
(where <code>W = L − k − z + 2</code>).</p>
|
||||
<h3 id="counts-persistentcompactintmatrix">counts/ (PersistentCompactIntMatrix)</h3>
|
||||
<p>Present when <code>with_counts=true</code>. One column per genome; each row holds the
|
||||
per-genome k-mer count for the corresponding MPHF slot. Appended column-by-column
|
||||
during indexing and merge.</p>
|
||||
<h3 id="presence-persistentbitmatrix">presence/ (PersistentBitMatrix)</h3>
|
||||
<p>Present when the layer was built in presence/absence mode (merge path).
|
||||
One bit per genome per MPHF slot. Written during merge; never present on a
|
||||
freshly indexed single-genome layer.</p>
|
||||
<h2 id="metajson-partitionmeta">meta.json (PartitionMeta)</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">"n_layers"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Records how many <code>layer_N/</code> directories exist under <code>index/</code>. Incremented by
|
||||
each merge that adds a layer.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -751,6 +751,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -829,6 +857,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -1046,61 +1130,49 @@
|
||||
|
||||
<h1 id="superkmer-implementation">SuperKmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p>A super-kmer is stored as a <strong>32-bit header</strong> followed by a <strong>byte-aligned nucleotide sequence</strong> (2 bits/base, nucleotide 0 at the MSB of the first byte):</p>
|
||||
<p><code>SuperKmer</code> holds two separate fields:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">SuperKmer</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">count</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">,</span>
|
||||
<span class="w"> </span><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="n">inner</span><span class="p">:</span><span class="w"> </span><span class="nc">PackedSeq</span><span class="p">,</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>PackedSeq</code> stores a 2-bit packed DNA sequence as a heap-allocated <code>Box<[u8]></code> plus a <code>tail: u8</code> field:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Bits</th>
|
||||
<th>Type</th>
|
||||
<th>Role</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>COUNT</td>
|
||||
<td>24</td>
|
||||
<td>Occurrence count (≤ 16 M)</td>
|
||||
<td><code>tail</code></td>
|
||||
<td><code>u8</code></td>
|
||||
<td>Number of valid nucleotides in the last byte: 0 encodes 4, 1–3 are identity</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NKMERS</td>
|
||||
<td>8</td>
|
||||
<td>Number of kmers (= seq_length − k + 1, range 1–255)</td>
|
||||
<td><code>seq</code></td>
|
||||
<td><code>Box<[u8]></code></td>
|
||||
<td>2-bit packed bytes, nucleotide 0 at bits 7–6 of <code>seq[0]</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Bit layout (MSB to LSB): <code>[31:8] COUNT [7:0] NKMERS</code></p>
|
||||
<p>NKMERS is stored as a raw <code>u8</code> in <strong>kmer units</strong>, not nucleotides. The nucleotide length is recovered as <code>NKMERS + k − 1</code>. This avoids the awkward wrapping convention (<code>0 = 256</code>) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>unit</th>
|
||||
<th>u8 covers</th>
|
||||
<th>max nucleotides</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>nucleotides</td>
|
||||
<td>255 nt</td>
|
||||
<td>225 kmers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>kmers</strong></td>
|
||||
<td><strong>255 kmers</strong></td>
|
||||
<td><strong>285 nt</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The public accessors:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">n_kmers</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">K</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">);</span><span class="w"> </span><span class="p">}</span>
|
||||
<p>Nucleotide length is recovered without storing it explicitly:</p>
|
||||
<div class="highlight"><pre><span></span><code>seql = (seq.len() - 1) * 4 + tail_count(tail)
|
||||
</code></pre></div>
|
||||
<p>There is no packed header word — <code>count</code> and the sequence live in separate fields.</p>
|
||||
<p>The on-disk binary format (produced by <code>write_to_binary</code>) is:</p>
|
||||
<div class="highlight"><pre><span></span><code>[varint(count)] [u8: seql − k] [packed bytes…]
|
||||
</code></pre></div>
|
||||
<p><code>seql − k</code> fits in a <code>u8</code> when <code>n_kmers = seql − k + 1 ≤ MAX_KMERS_PER_CHUNK (= 256)</code>. If a super-kmer exceeds 256 kmers, <code>write_to_binary</code> splits it into overlapping chunks (k−1 nucleotide overlap, same count per chunk), each a self-contained record readable by <code>read_from_binary</code>.</p>
|
||||
<p>The public accessors operate on the struct fields directly:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">inner</span><span class="p">.</span><span class="n">seql</span><span class="p">()</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).</p>
|
||||
<p>The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.</p>
|
||||
<h2 id="ascii-encoding-and-decoding">ASCII encoding and decoding</h2>
|
||||
<p>Two lookup tables handle ASCII ↔ 2-bit conversion:</p>
|
||||
<ul>
|
||||
@@ -1125,7 +1197,7 @@
|
||||
</code></pre></div>
|
||||
<p><code>REVCOMP4</code> is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.</p>
|
||||
<p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 − seql × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice<u8, Msb0>::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">seql</span><span class="p">();</span>
|
||||
<span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span>
|
||||
<span class="n">bits</span><span class="p">.</span><span class="n">rotate_left</span><span class="p">(</span><span class="n">shift</span><span class="p">)</span>
|
||||
<span class="n">bits</span><span class="p">[</span><span class="n">len</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">shift</span><span class="o">..</span><span class="p">].</span><span class="n">fill</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
|
||||
@@ -1143,7 +1215,7 @@
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<h2 id="minimizer-sliding-window">Minimizer sliding window</h2>
|
||||
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which maintains the current minimizer with a <strong>monotonic deque</strong> over a sliding window of W = k − m + 1 m-mer positions.</p>
|
||||
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which tracks the current minimizer with a <strong>monotonic deque</strong> (<code>Ring<MmerItem, 32></code>) inside <code>RollingStat</code>, a rolling-window entropy and minimizer tracker.</p>
|
||||
<p>Each deque entry stores:</p>
|
||||
<table>
|
||||
<thead>
|
||||
@@ -1167,20 +1239,11 @@
|
||||
<tr>
|
||||
<td><code>hash</code></td>
|
||||
<td>u64</td>
|
||||
<td><span class="arithmatex">\(H(\text{canonical})\)</span> — ordering key for random minimizer selection</td>
|
||||
<td><code>hash_kmer(canonical << (64 − 2m))</code> — ordering key for random minimizer selection</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The hash <span class="arithmatex">\(H\)</span> is the seeded splitmix64 finalizer (see <a href="../../theory/minimizer/">Minimizer selection</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">hash_mmer</span><span class="p">(</span><span class="n">canonical</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u64</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">canonical</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="mh">0x9e3779b97f4a7c15</span><span class="p">;</span><span class="w"> </span><span class="c1">// seed: eliminates fixed point at 0</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">30</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0xbf58476d1ce4e5b9</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">27</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0x94d049bb133111eb</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">31</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>The hash uses the seeded splitmix64 finalizer (<code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>), the same function as <code>kmer::hash_kmer</code>.</p>
|
||||
<p>On each new nucleotide, once the window is full, the deque is updated:</p>
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — minimizer deque update</p>
|
||||
@@ -1196,17 +1259,21 @@
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<p>The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.</p>
|
||||
<p>A super-kmer boundary is emitted when the minimizer changes: <code>deque.front.hash ≠ prev_hash</code>. The <code>canonical</code> field of the front entry is <strong>not</strong> used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key <span class="arithmatex">\(H(\text{canonical})\)</span> can be recomputed independently at routing time from the stored <code>minimizer_pos</code>, without inheriting the minimum-order-statistic bias (see <a href="../../theory/minimizer/#partition-key-independence">Minimizer selection — partition key independence</a>).</p>
|
||||
<p>A super-kmer boundary is emitted when the minimizer changes: <code>current_minimizer != prev_minimizer</code>. <code>SuperKmerIter</code> also emits a boundary when:</p>
|
||||
<ul>
|
||||
<li>entropy of the current k-mer falls at or below the threshold θ (cursor retreated by k−1)</li>
|
||||
<li>super-kmer length reaches 256 nucleotides (cursor retreated by k)</li>
|
||||
</ul>
|
||||
<h2 id="kmer-extraction">Kmer extraction</h2>
|
||||
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i, k)</code>, which returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Result</span><span class="o"><</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">></span>
|
||||
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i)</code>, which delegates to <code>PackedSeq::extract::<KLen>(i)</code> and returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Result</span><span class="o"><</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">></span>
|
||||
</code></pre></div>
|
||||
<p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a big-endian <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p>
|
||||
<p>The bit slice <code>seq[i*2 .. (i+k)*2]</code> (Msb0 order) is loaded as a <code>u64</code> via <code>bitvec::load_be</code>, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.</p>
|
||||
<hr />
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — Super-kmer reverse complement</p>
|
||||
<div class="highlight"><pre><span></span><code>procedure SuperKmerRevcomp(seq, SEQL):
|
||||
seql ← NKMERS + k − 1 -- nucleotide length
|
||||
seql ← nucleotide length
|
||||
n ← ⌈seql / 4⌉ -- number of bytes
|
||||
shift ← n × 8 − seql × 2 -- padding bits to flush
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user