refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
This commit is contained in:
Eric Coissac
2026-04-29 22:52:42 +02:00
parent 4e26e3bd40
commit 27f5e88a7b
72 changed files with 10093 additions and 1626 deletions
+240 -13
View File
@@ -12,7 +12,7 @@
<link rel="prev" href="../storage/">
<link rel="next" href="../../architecture/sequences/invariant/">
<link rel="next" href="../unitig_evidence/">
@@ -64,7 +64,7 @@
<div data-md-component="skip">
<a href="#mphf-selection-analysis-in-progress" class="md-skip">
<a href="#mphf-selection-two-phase-indexing-architecture" class="md-skip">
Skip to content
</a>
@@ -230,7 +230,7 @@
<li class="md-nav__item">
<a href="../../theory/kmers/" class="md-nav__link">
<a href="../../kmers/" class="md-nav__link">
@@ -313,6 +313,34 @@
<li class="md-nav__item">
<a href="../../theory/minimizer/" class="md-nav__link">
<span class="md-ellipsis">
Minimizer selection
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../theory/indexing/" class="md-nav__link">
@@ -509,6 +537,34 @@
<li class="md-nav__item">
<a href="../obipipeline/" class="md-nav__link">
<span class="md-ellipsis">
obipipeline library
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../storage/" class="md-nav__link">
@@ -597,6 +653,56 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#indexing-architecture" class="md-nav__link">
<span class="md-ellipsis">
Indexing architecture
</span>
</a>
<nav class="md-nav" aria-label="Indexing architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
<span class="md-ellipsis">
Superkmer vs kmer counts
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
<span class="md-ellipsis">
Phase 1 — provisional index and spectrum
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-2-definitive-index" class="md-nav__link">
<span class="md-ellipsis">
Phase 2 — definitive index
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#candidates" class="md-nav__link">
<span class="md-ellipsis">
@@ -606,6 +712,17 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#mphf-choice-per-phase" class="md-nav__link">
<span class="md-ellipsis">
MPHF choice per phase
</span>
</a>
</li>
<li class="md-nav__item">
@@ -650,6 +767,34 @@
<li class="md-nav__item">
<a href="../unitig_evidence/" class="md-nav__link">
<span class="md-ellipsis">
Unitig evidence encoding
</span>
</a>
</li>
</ul>
</nav>
@@ -765,6 +910,56 @@
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#indexing-architecture" class="md-nav__link">
<span class="md-ellipsis">
Indexing architecture
</span>
</a>
<nav class="md-nav" aria-label="Indexing architecture">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
<span class="md-ellipsis">
Superkmer vs kmer counts
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
<span class="md-ellipsis">
Phase 1 — provisional index and spectrum
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#phase-2-definitive-index" class="md-nav__link">
<span class="md-ellipsis">
Phase 2 — definitive index
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#candidates" class="md-nav__link">
<span class="md-ellipsis">
@@ -774,6 +969,17 @@
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#mphf-choice-per-phase" class="md-nav__link">
<span class="md-ellipsis">
MPHF choice per phase
</span>
</a>
</li>
<li class="md-nav__item">
@@ -826,29 +1032,50 @@
<h1 id="mphf-selection-analysis-in-progress">MPHF selection — analysis in progress</h1>
<p>The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.</p>
<h1 id="mphf-selection-two-phase-indexing-architecture">MPHF selection — two-phase indexing architecture</h1>
<h2 id="indexing-architecture">Indexing architecture</h2>
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.</p>
<h3 id="superkmer-vs-kmer-counts">Superkmer vs kmer counts</h3>
<p>The <code>SKFileMeta</code> sidecar written by <code>SKFileWriter</code> records <code>instances</code> (unique superkmers) and <code>length_sum</code> (total nucleotides). A superkmer of length L contains L k + 1 kmers, so the kmer count per partition can be estimated as <code>length_sum instances × (k 1)</code>. This is an <strong>overestimate</strong> of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.</p>
<p>Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.</p>
<h3 id="phase-1-provisional-index-and-spectrum">Phase 1 — provisional index and spectrum</h3>
<ol>
<li>Enumerate all kmers from the dereplicated superkmers of the partition.</li>
<li>Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).</li>
<li>Accumulate counts: for each kmer in each superkmer, <code>count[MPHF(kmer)] += sk.count()</code>.</li>
<li>Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).</li>
<li>Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.</li>
<li>Discard the provisional MPHF.</li>
</ol>
<h3 id="phase-2-definitive-index">Phase 2 — definitive index</h3>
<p>Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).</p>
<hr />
<h2 id="candidates">Candidates</h2>
<p><strong>boomphf</strong> (BBHash algorithm, maintained by 10X Genomics):</p>
<ul>
<li>~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)</li>
<li>Parallel construction; well-tested with DNA kmer data at scale</li>
<li>Drawback: largest space footprint of the three</li>
<li>Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2</li>
</ul>
<p><strong>ptr_hash</strong> (PtrHash algorithm, Groot Koerkamp, SEA 2025):</p>
<ul>
<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 812 ns/key for u64 in tight loops) and fastest construction (≥3.1×)</li>
<li>Theoretical foundation solid; paper and Rust crate from the same author</li>
<li>Requires exact key count at construction — available at phase 2</li>
<li>Drawback: published February 2025 — very young, no production track record</li>
</ul>
<p><strong>FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
<ul>
<li>~2.1 bits/key — most compact of the three; good query speed; parallelisable construction</li>
<li>More established than ptr_hash; actively maintained</li>
<li>Currently preferred candidate</li>
<li>Works well with overestimated capacity → natural fit for phase 1</li>
</ul>
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
<p><strong>Phase 2</strong> (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.13.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.</p>
<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
<hr />
<h2 id="space-at-scale">Space at scale</h2>
<p>For 1 024 partitions × 100 M kmers/partition:</p>
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
<table>
<thead>
<tr>
@@ -875,15 +1102,15 @@
</tr>
</tbody>
</table>
<p>In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 330 M kmers → 18 MB per MPHF, well within RAM.</p>
<p>For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 330 M unique kmers → 18 MB per phase-2 MPHF, well within RAM.</p>
<h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
<p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 18 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
<p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
<h2 id="open-questions">Open questions</h2>
<ul>
<li>Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.</li>
<li>Evaluate whether ptr_hash's query speed advantage (2.13.3×) justifies adopting a crate that is less than a year old.</li>
<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.</li>
<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
<li>Revisit ptr_hash for phase 2 once the crate has broader production track record.</li>
<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.</li>
<li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
</ul>