refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-04-29 22:52:42 +02:00
parent 4e26e3bd40
commit 27f5e88a7b
72 changed files with 10093 additions and 1626 deletions
@@ -12,7 +12,7 @@
        <link rel="prev" href="../storage/">
      
      
-        <link rel="next" href="../../architecture/sequences/invariant/">
+        <link rel="next" href="../unitig_evidence/">
      
      
        
@@ -64,7 +64,7 @@
    <div data-md-component="skip">
      
        
-        <a href="#mphf-selection-analysis-in-progress" class="md-skip">
+        <a href="#mphf-selection-two-phase-indexing-architecture" class="md-skip">
          Skip to content
        </a>
      
@@ -230,7 +230,7 @@
  
  
    <li class="md-nav__item">
-      <a href="../../theory/kmers/" class="md-nav__link">
+      <a href="../../kmers/" class="md-nav__link">
        
  
  
@@ -313,6 +313,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../../theory/minimizer/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Minimizer selection
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../../theory/indexing/" class="md-nav__link">
        
@@ -509,6 +537,34 @@
  
  
  
+    <li class="md-nav__item">
+      <a href="../obipipeline/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    obipipeline library
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+  
    <li class="md-nav__item">
      <a href="../storage/" class="md-nav__link">
        
@@ -597,6 +653,56 @@
    </label>
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
+        <li class="md-nav__item">
+  <a href="#indexing-architecture" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Indexing architecture
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Indexing architecture">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#superkmer-vs-kmer-counts" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Superkmer vs kmer counts
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Phase 1 — provisional index and spectrum
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#phase-2-definitive-index" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Phase 2 — definitive index
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+      
        <li class="md-nav__item">
  <a href="#candidates" class="md-nav__link">
    <span class="md-ellipsis">
@@ -606,6 +712,17 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#mphf-choice-per-phase" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        MPHF choice per phase
+      
+    </span>
+  </a>
+  
 </li>
      
        <li class="md-nav__item">
@@ -650,6 +767,34 @@

              
            
+              
+                
+  
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../unitig_evidence/" class="md-nav__link">
+        
+  
+  
+  <span class="md-ellipsis">
+    
+  
+    Unitig evidence encoding
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
          </ul>
        </nav>
      
@@ -765,6 +910,56 @@
    </label>
    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
      
+        <li class="md-nav__item">
+  <a href="#indexing-architecture" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Indexing architecture
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Indexing architecture">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#superkmer-vs-kmer-counts" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Superkmer vs kmer counts
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Phase 1 — provisional index and spectrum
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#phase-2-definitive-index" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Phase 2 — definitive index
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+      
        <li class="md-nav__item">
  <a href="#candidates" class="md-nav__link">
    <span class="md-ellipsis">
@@ -774,6 +969,17 @@
    </span>
  </a>
  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#mphf-choice-per-phase" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        MPHF choice per phase
+      
+    </span>
+  </a>
+  
 </li>
      
        <li class="md-nav__item">
@@ -826,29 +1032,50 @@



-<h1 id="mphf-selection-analysis-in-progress">MPHF selection — analysis in progress</h1>
-<p>The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.</p>
+<h1 id="mphf-selection-two-phase-indexing-architecture">MPHF selection — two-phase indexing architecture</h1>
+<h2 id="indexing-architecture">Indexing architecture</h2>
+<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.</p>
+<h3 id="superkmer-vs-kmer-counts">Superkmer vs kmer counts</h3>
+<p>The <code>SKFileMeta</code> sidecar written by <code>SKFileWriter</code> records <code>instances</code> (unique superkmers) and <code>length_sum</code> (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as <code>length_sum − instances × (k − 1)</code>. This is an <strong>overestimate</strong> of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.</p>
+<p>Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.</p>
+<h3 id="phase-1-provisional-index-and-spectrum">Phase 1 — provisional index and spectrum</h3>
+<ol>
+<li>Enumerate all kmers from the dereplicated superkmers of the partition.</li>
+<li>Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).</li>
+<li>Accumulate counts: for each kmer in each superkmer, <code>count[MPHF(kmer)] += sk.count()</code>.</li>
+<li>Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).</li>
+<li>Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.</li>
+<li>Discard the provisional MPHF.</li>
+</ol>
+<h3 id="phase-2-definitive-index">Phase 2 — definitive index</h3>
+<p>Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).</p>
+<hr />
 <h2 id="candidates">Candidates</h2>
 <p><strong>boomphf</strong> (BBHash algorithm, maintained by 10X Genomics):</p>
 <ul>
 <li>~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)</li>
 <li>Parallel construction; well-tested with DNA kmer data at scale</li>
-<li>Drawback: largest space footprint of the three</li>
+<li>Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2</li>
 </ul>
 <p><strong>ptr_hash</strong> (PtrHash algorithm, Groot Koerkamp, SEA 2025):</p>
 <ul>
 <li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)</li>
-<li>Theoretical foundation solid; paper and Rust crate from the same author</li>
+<li>Requires exact key count at construction — available at phase 2</li>
 <li>Drawback: published February 2025 — very young, no production track record</li>
 </ul>
 <p><strong>FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
 <ul>
 <li>~2.1 bits/key — most compact of the three; good query speed; parallelisable construction</li>
 <li>More established than ptr_hash; actively maintained</li>
-<li>Currently preferred candidate</li>
+<li>Works well with overestimated capacity → natural fit for phase 1</li>
 </ul>
+<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
+<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
+<p><strong>Phase 2</strong> (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.</p>
+<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
+<hr />
 <h2 id="space-at-scale">Space at scale</h2>
-<p>For 1 024 partitions × 100 M kmers/partition:</p>
+<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
 <table>
 <thead>
 <tr>
@@ -875,15 +1102,15 @@
 </tr>
 </tbody>
 </table>
-<p>In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.</p>
+<p>For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.</p>
 <h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
 <p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
 <p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
 <h2 id="open-questions">Open questions</h2>
 <ul>
-<li>Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.</li>
-<li>Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.</li>
-<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.</li>
+<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
+<li>Revisit ptr_hash for phase 2 once the crate has broader production track record.</li>
+<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.</li>
 <li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
 </ul>