docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
This commit is contained in:
Eric Coissac
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
File diff suppressed because it is too large Load Diff
+93 -6
View File
@@ -718,6 +718,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -796,6 +824,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
@@ -1010,17 +1094,20 @@
<p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base &amp; 0b11</code>.</p>
<h2 id="kmer-encoding">Kmer encoding</h2>
<p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i &lt; k): <code>(kmer &gt;&gt; (62 - 2*i)) &amp; 0b11</code>.</p>
<p>Reverse complement is computed via a <strong>16-bit lookup table</strong> (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.</p>
<p>Reverse complement is computed by <strong>bit manipulation in four steps</strong>, with no lookup table:</p>
<div class="admonition abstract">
<p class="admonition-title">Algorithm — Kmer reverse complement</p>
<div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer &amp; 0xFFFF] &lt;&lt; 48
| TABLE16[(kmer &gt;&gt; 16) &amp; 0xFFFF] &lt;&lt; 32
| TABLE16[(kmer &gt;&gt; 32) &amp; 0xFFFF] &lt;&lt; 16
| TABLE16[(kmer &gt;&gt; 48) &amp; 0xFFFF]
return raw &lt;&lt; (64 - 2*k)
x ← ~kmer -- complement all bases
x ← swap_bytes(x) -- reverse byte order
x ← ((x &gt;&gt; 4) &amp; 0x0F0F0F0F0F0F0F0F)
| ((x &amp; 0x0F0F0F0F0F0F0F0F) &lt;&lt; 4) -- swap nibbles within each byte
x ← ((x &gt;&gt; 2) &amp; 0x3333333333333333)
| ((x &amp; 0x3333333333333333) &lt;&lt; 2) -- swap 2-bit pairs within each nibble
return x &lt;&lt; (64 - 2*k) -- re-align to MSB
</code></pre></div>
</div>
<p>The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.</p>
<p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p>
<div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer))
</code></pre></div>
File diff suppressed because it is too large Load Diff
+85 -1
View File
@@ -773,6 +773,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -851,6 +879,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
@@ -1109,7 +1193,7 @@
<h2 id="final-score">Final score</h2>
<p>The filter computes <span class="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p>
<div class="arithmatex">\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]</div>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) \leq \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) &lt; \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
<h2 id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2>
<p><span class="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <span class="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p>
<p>For the normalised score <span class="arithmatex">\(\hat{H}\)</span>, dividing by <span class="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p>
File diff suppressed because it is too large Load Diff
+84
View File
@@ -718,6 +718,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -796,6 +824,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>
File diff suppressed because it is too large Load Diff
+84
View File
@@ -762,6 +762,34 @@
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
@@ -840,6 +868,62 @@
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
</ul>
</nav>