docs: expand kmer indexing, filtering, and merging documentation
Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
@@ -514,10 +514,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -558,10 +569,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -751,6 +784,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../evidence_elimination/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Evidence elimination (discussion)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obilayeredmap/" class="md-nav__link">
|
||||
|
||||
@@ -829,6 +890,62 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../merge/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Merge command
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../rebuild_filter/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmer filtering (rebuild/dump/unitig)
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -973,10 +1090,21 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#memory-layout" class="md-nav__link">
|
||||
<a href="#types-and-layout" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Memory layout
|
||||
Types and layout
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#global-parameters" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Global parameters
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1017,10 +1145,32 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-form" class="md-nav__link">
|
||||
<a href="#canonical-form-and-canonicalkmerof" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical form
|
||||
Canonical form and CanonicalKmerOf
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#sliding-window-helpers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Sliding window helpers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#hashing" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Hashing
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -1045,12 +1195,43 @@
|
||||
|
||||
|
||||
<h1 id="kmer-implementation">Kmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p><code>Kmer</code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code>:</p>
|
||||
<h2 id="types-and-layout">Types and layout</h2>
|
||||
<p><code>KmerOf<L></code> is a <code>#[repr(transparent)]</code> newtype over <code>u64</code> parameterized by a <code>KmerLength</code> marker:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="cp">#[repr(transparent)]</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">Kmer</span><span class="p">(</span><span class="kt">u64</span><span class="p">);</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">KmerOf</span><span class="o"><</span><span class="n">L</span><span class="p">:</span><span class="w"> </span><span class="nc">KmerLength</span><span class="o">></span><span class="p">(</span><span class="kt">u64</span><span class="p">,</span><span class="w"> </span><span class="n">PhantomData</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="p">);</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is <strong>not stored</strong> — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.</p>
|
||||
<p>Three marker types implement <code>KmerLength</code>:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Marker</th>
|
||||
<th><code>len()</code> source</th>
|
||||
<th>Used for</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>KLen</code></td>
|
||||
<td><code>params::k()</code></td>
|
||||
<td>k-mers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>MLen</code></td>
|
||||
<td><code>params::m()</code></td>
|
||||
<td>minimizers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ConstLen<N></code></td>
|
||||
<td>const generic <code>N</code></td>
|
||||
<td>tests</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Public aliases:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Kmer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KmerOf</span><span class="o"><</span><span class="n">KLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// k-mer, global k</span>
|
||||
<span class="k">pub</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nc">Minimizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="o"><</span><span class="n">MLen</span><span class="o">></span><span class="p">;</span><span class="w"> </span><span class="c1">// canonical m-mer, global m</span>
|
||||
</code></pre></div>
|
||||
<p>Nucleotides are packed 2 bits each, <strong>left-aligned</strong>, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is <strong>not stored</strong> — every operation reads it from <code>L::len()</code>.</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -1071,33 +1252,41 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2 id="global-parameters">Global parameters</h2>
|
||||
<p><code>params::set_k(k)</code> / <code>params::k()</code> and <code>params::set_m(m)</code> / <code>params::m()</code> are backed by <code>OnceLock<usize></code> in production (write-once, panic on conflict) and by <code>thread_local! { Cell<usize> }</code> in test builds (per-thread, freely writable). <code>params::init(k, m)</code> sets both in one call.</p>
|
||||
<h2 id="encoding">Encoding</h2>
|
||||
<p><code>Kmer::from_ascii(ascii, k)</code> encodes the first k bytes of an ASCII slice using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<p><code>KmerOf::<L>::from_ascii(ascii)</code> encodes the first <code>L::len()</code> bytes using the shared <code>ENC</code> table (see <a href="../superkmer/#ascii-encoding-and-decoding">SuperKmer — ASCII encoding</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">k</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">encode_base</span><span class="p">(</span><span class="n">ascii</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">u64</span><span class="p">;</span>
|
||||
<span class="p">}</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="decoding">Decoding</h2>
|
||||
<p><code>write_ascii(k, buf)</code> appends k ASCII characters to a caller-supplied <code>Vec<u8></code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii(k)</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<p><code>write_ascii(writer)</code> writes k ASCII characters to any <code>W: Write</code> using the shared <code>DEC4</code> table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.</p>
|
||||
<p><code>to_ascii()</code> is a convenience wrapper that allocates and returns a <code>Vec<u8></code>; intended for tests and display only.</p>
|
||||
<h2 id="reverse-complement">Reverse complement</h2>
|
||||
<p>Computed as pure arithmetic — no lookup table, no memory access:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="c1">// complement</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">swap_bytes</span><span class="p">();</span><span class="w"> </span><span class="c1">// reverse bytes</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">4</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x0F0F0F0F0F0F0F0F</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap nibbles</span>
|
||||
<span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">((</span><span class="n">x</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0x3333333333333333</span><span class="p">)</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">2</span><span class="p">);</span><span class="w"> </span><span class="c1">// swap 2-bit groups</span>
|
||||
<span class="n">Kmer</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">))</span>
|
||||
<span class="n">KmerOf</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="p">(</span><span class="mi">64</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
<p>After complementing, bytes are reversed (<code>swap_bytes</code>), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.</p>
|
||||
<h2 id="canonical-form">Canonical form</h2>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">Self</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">(</span><span class="n">k</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o">*</span><span class="bp">self</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="p">}</span>
|
||||
<h2 id="canonical-form-and-canonicalkmerof">Canonical form and <code>CanonicalKmerOf</code></h2>
|
||||
<p><code>canonical()</code> returns a <code>CanonicalKmerOf<L></code> — a distinct newtype that carries the same <code>u64</code> layout but enforces the invariant that the stored value equals <code>min(kmer, revcomp)</code>:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">canonical</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nc">CanonicalKmerOf</span><span class="o"><</span><span class="n">L</span><span class="o">></span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">rc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">revcomp</span><span class="p">();</span>
|
||||
<span class="w"> </span><span class="n">CanonicalKmerOf</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">rc</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">PhantomData</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>Lexicographic minimum of forward and reverse-complement, comparing the raw <code>u64</code> values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.</p>
|
||||
<p><code>CanonicalKmerOf::from_raw_unchecked(raw)</code> is the only other public constructor, for trusted paths such as deserialisation.</p>
|
||||
<h2 id="sliding-window-helpers">Sliding window helpers</h2>
|
||||
<p><code>push_right(nuc)</code> / <code>push_left(nuc)</code> shift the window by one base in O(1). <code>is_overlapping(other)</code> checks whether the last k−1 nucleotides of <code>self</code> equal the first k−1 of <code>other</code>.</p>
|
||||
<h2 id="hashing">Hashing</h2>
|
||||
<p><code>hash_kmer(raw: u64) -> u64</code> computes <code>mix64(raw ^ 0x9e3779b97f4a7c15)</code>, the seeded splitmix64 finalizer. <code>CanonicalKmerOf::seq_hash()</code> delegates to <code>hash_kmer</code>.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user