docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
Eric Coissac
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
+93 -6
View File
@@ -718,6 +718,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -796,6 +824,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
@@ -1010,17 +1094,20 @@
<p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base &amp; 0b11</code>.</p>
<h2 id="kmer-encoding">Kmer encoding</h2>
<p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i &lt; k): <code>(kmer &gt;&gt; (62 - 2*i)) &amp; 0b11</code>.</p>
<p>Reverse complement is computed via a <strong>16-bit lookup table</strong> (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.</p>
<p>Reverse complement is computed by <strong>bit manipulation in four steps</strong>, with no lookup table:</p>
<div class="admonition abstract">
<p class="admonition-title">Algorithm — Kmer reverse complement</p>
<div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer &amp; 0xFFFF] &lt;&lt; 48
| TABLE16[(kmer &gt;&gt; 16) &amp; 0xFFFF] &lt;&lt; 32
| TABLE16[(kmer &gt;&gt; 32) &amp; 0xFFFF] &lt;&lt; 16
| TABLE16[(kmer &gt;&gt; 48) &amp; 0xFFFF]
return raw &lt;&lt; (64 - 2*k)
x ← ~kmer -- complement all bases
x ← swap_bytes(x) -- reverse byte order
x ← ((x &gt;&gt; 4) &amp; 0x0F0F0F0F0F0F0F0F)
| ((x &amp; 0x0F0F0F0F0F0F0F0F) &lt;&lt; 4) -- swap nibbles within each byte
x ← ((x &gt;&gt; 2) &amp; 0x3333333333333333)
| ((x &amp; 0x3333333333333333) &lt;&lt; 2) -- swap 2-bit pairs within each nibble
return x &lt;&lt; (64 - 2*k) -- re-align to MSB
</code></pre></div>
</div>
<p>The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.</p>
<p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p>
<div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer))
</code></pre></div>