docs: expand kmer indexing, filtering, and merging documentation

Expands MkDocs navigation and documentation for evidence elimination, the merge command, and kmer filtering. Refactors kmer representation to a generic `KmerOf<L>` type with a bitwise reverse complement algorithm. Unifies MPHF construction, introduces approximate fingerprint-based indexing, and updates the pipeline, chunkreader, and storage layouts. Adds code coverage reports and clarifies architectural invariants for improved maintainability.
2026-06-04 21:27:01 +02:00
parent 9306ec1c56
commit bb7adc1154
50 changed files with 34226 additions and 1576 deletions
@@ -718,6 +718,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../../implementation/evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../../implementation/obilayeredmap/" class="md-nav__link">
        
@@ -796,6 +824,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -1010,17 +1094,20 @@
 <p>The Watson-Crick complement of any base is its bitwise NOT on 2 bits: <code>complement(base) = ~base &amp; 0b11</code>.</p>
 <h2 id="kmer-encoding">Kmer encoding</h2>
 <p>A kmer fits in a single <code>u64</code>. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i &lt; k): <code>(kmer &gt;&gt; (62 - 2*i)) &amp; 0b11</code>.</p>
-<p>Reverse complement is computed via a <strong>16-bit lookup table</strong> (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.</p>
+<p>Reverse complement is computed by <strong>bit manipulation in four steps</strong>, with no lookup table:</p>
 <div class="admonition abstract">
 <p class="admonition-title">Algorithm — Kmer reverse complement</p>
 <div class="highlight"><pre><span></span><code>procedure KmerRevcomp(kmer, k):
-    raw ←   TABLE16[kmer &amp; 0xFFFF]         &lt;&lt; 48
-          | TABLE16[(kmer &gt;&gt; 16) &amp; 0xFFFF] &lt;&lt; 32
-          | TABLE16[(kmer &gt;&gt; 32) &amp; 0xFFFF] &lt;&lt; 16
-          | TABLE16[(kmer &gt;&gt; 48) &amp; 0xFFFF]
-    return raw &lt;&lt; (64 - 2*k)
+    x ← ~kmer                                           -- complement all bases
+    x ← swap_bytes(x)                                   -- reverse byte order
+    x ← ((x &gt;&gt; 4) &amp; 0x0F0F0F0F0F0F0F0F)
+       | ((x &amp; 0x0F0F0F0F0F0F0F0F) &lt;&lt; 4)               -- swap nibbles within each byte
+    x ← ((x &gt;&gt; 2) &amp; 0x3333333333333333)
+       | ((x &amp; 0x3333333333333333) &lt;&lt; 2)                -- swap 2-bit pairs within each nibble
+    return x &lt;&lt; (64 - 2*k)                              -- re-align to MSB
 </code></pre></div>
 </div>
+<p>The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.</p>
 <p>The <strong>canonical form</strong> is the lexicographic minimum of the kmer and its reverse complement:</p>
 <div class="highlight"><pre><span></span><code>canonical(kmer) = min(kmer, revcomp(kmer))
 </code></pre></div>
@@ -773,6 +773,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../../implementation/evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../../implementation/obilayeredmap/" class="md-nav__link">
        
@@ -851,6 +879,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -1109,7 +1193,7 @@
 <h2 id="final-score">Final score</h2>
 <p>The filter computes <span class="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p>
 <div class="arithmatex">\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]</div>
-<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) \leq \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
+<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <span class="arithmatex">\(\text{entropy}(kmer) &lt; \theta\)</span>, where <span class="arithmatex">\(\theta\)</span> is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
 <h2 id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2>
 <p><span class="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <span class="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p>
 <p>For the normalised score <span class="arithmatex">\(\hat{H}\)</span>, dividing by <span class="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p>
@@ -718,6 +718,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../../implementation/evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../../implementation/obilayeredmap/" class="md-nav__link">
        
@@ -796,6 +824,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -762,6 +762,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../../implementation/evidence_elimination/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Evidence elimination (discussion)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../../implementation/obilayeredmap/" class="md-nav__link">
        
@@ -840,6 +868,62 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/merge/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Merge command
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../implementation/rebuild_filter/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Kmer filtering (rebuild/dump/unitig)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>