refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -230,7 +230,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/indexing/" class="md-nav__link">
|
||||
|
||||
@@ -611,6 +639,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../storage/" class="md-nav__link">
|
||||
|
||||
@@ -661,6 +717,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
<link rel="prev" href="../storage/">
|
||||
|
||||
|
||||
<link rel="next" href="../../architecture/sequences/invariant/">
|
||||
<link rel="next" href="../unitig_evidence/">
|
||||
|
||||
|
||||
|
||||
@@ -64,7 +64,7 @@
|
||||
<div data-md-component="skip">
|
||||
|
||||
|
||||
<a href="#mphf-selection-analysis-in-progress" class="md-skip">
|
||||
<a href="#mphf-selection-two-phase-indexing-architecture" class="md-skip">
|
||||
Skip to content
|
||||
</a>
|
||||
|
||||
@@ -230,7 +230,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/indexing/" class="md-nav__link">
|
||||
|
||||
@@ -509,6 +537,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../storage/" class="md-nav__link">
|
||||
|
||||
@@ -597,6 +653,56 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexing-architecture" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Indexing architecture
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Indexing architecture">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Superkmer vs kmer counts
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Phase 1 — provisional index and spectrum
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#phase-2-definitive-index" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Phase 2 — definitive index
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#candidates" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -606,6 +712,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphf-choice-per-phase" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
MPHF choice per phase
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -650,6 +767,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -765,6 +910,56 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#indexing-architecture" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Indexing architecture
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Indexing architecture">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#superkmer-vs-kmer-counts" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Superkmer vs kmer counts
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#phase-1-provisional-index-and-spectrum" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Phase 1 — provisional index and spectrum
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#phase-2-definitive-index" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Phase 2 — definitive index
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#candidates" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@@ -774,6 +969,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#mphf-choice-per-phase" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
MPHF choice per phase
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -826,29 +1032,50 @@
|
||||
|
||||
|
||||
|
||||
<h1 id="mphf-selection-analysis-in-progress">MPHF selection — analysis in progress</h1>
|
||||
<p>The choice of Minimal Perfect Hash Function for phase 6 is not yet settled. Three candidates were evaluated.</p>
|
||||
<h1 id="mphf-selection-two-phase-indexing-architecture">MPHF selection — two-phase indexing architecture</h1>
|
||||
<h2 id="indexing-architecture">Indexing architecture</h2>
|
||||
<p>Kmer indexing per partition proceeds in two phases. The separation is necessary because the exact number of unique kmers in a partition is not known until after counting and filtering.</p>
|
||||
<h3 id="superkmer-vs-kmer-counts">Superkmer vs kmer counts</h3>
|
||||
<p>The <code>SKFileMeta</code> sidecar written by <code>SKFileWriter</code> records <code>instances</code> (unique superkmers) and <code>length_sum</code> (total nucleotides). A superkmer of length L contains L − k + 1 kmers, so the kmer count per partition can be estimated as <code>length_sum − instances × (k − 1)</code>. This is an <strong>overestimate</strong> of unique kmers: two distinct superkmers (different flanking contexts, same minimizer) can share kmers. The exact count of unique kmers is only known after enumerating and deduplicating them.</p>
|
||||
<p>Note: two superkmers sharing a kmer necessarily share the same minimizer and therefore always land in the same partition — no kmer can appear in two different partitions.</p>
|
||||
<h3 id="phase-1-provisional-index-and-spectrum">Phase 1 — provisional index and spectrum</h3>
|
||||
<ol>
|
||||
<li>Enumerate all kmers from the dereplicated superkmers of the partition.</li>
|
||||
<li>Build a provisional MPHF over this key set; capacity is pre-allocated from the sidecar estimate (slight overestimate, harmless).</li>
|
||||
<li>Accumulate counts: for each kmer in each superkmer, <code>count[MPHF(kmer)] += sk.count()</code>.</li>
|
||||
<li>Compute the kmer frequency spectrum (histogram: occurrences → number of kmers).</li>
|
||||
<li>Apply count filter (e.g. discard singletons). After filtering, the exact number of surviving kmers is known.</li>
|
||||
<li>Discard the provisional MPHF.</li>
|
||||
</ol>
|
||||
<h3 id="phase-2-definitive-index">Phase 2 — definitive index</h3>
|
||||
<p>Build a new MPHF over the filtered kmer set only, with the exact key count available. This is the persistent per-partition index used for all downstream operations (queries, set operations).</p>
|
||||
<hr />
|
||||
<h2 id="candidates">Candidates</h2>
|
||||
<p><strong>boomphf</strong> (BBHash algorithm, maintained by 10X Genomics):</p>
|
||||
<ul>
|
||||
<li>~3.7 bits/key; mature crate, used in production bioinformatics (Pufferfish, Piscem)</li>
|
||||
<li>Parallel construction; well-tested with DNA kmer data at scale</li>
|
||||
<li>Drawback: largest space footprint of the three</li>
|
||||
<li>Drawback: largest space footprint; streaming construction (no exact count needed) was its main differentiator — irrelevant here since exact count is available at phase 2</li>
|
||||
</ul>
|
||||
<p><strong>ptr_hash</strong> (PtrHash algorithm, Groot Koerkamp, SEA 2025):</p>
|
||||
<ul>
|
||||
<li>~2.4 bits/key; fastest queries (≥2.1× over alternatives, 8–12 ns/key for u64 in tight loops) and fastest construction (≥3.1×)</li>
|
||||
<li>Theoretical foundation solid; paper and Rust crate from the same author</li>
|
||||
<li>Requires exact key count at construction — available at phase 2</li>
|
||||
<li>Drawback: published February 2025 — very young, no production track record</li>
|
||||
</ul>
|
||||
<p><strong>FMPHGO</strong> (<code>ph</code> crate, Beling, ACM JEA 2023):</p>
|
||||
<ul>
|
||||
<li>~2.1 bits/key — most compact of the three; good query speed; parallelisable construction</li>
|
||||
<li>More established than ptr_hash; actively maintained</li>
|
||||
<li>Currently preferred candidate</li>
|
||||
<li>Works well with overestimated capacity → natural fit for phase 1</li>
|
||||
</ul>
|
||||
<h2 id="mphf-choice-per-phase">MPHF choice per phase</h2>
|
||||
<p><strong>Phase 1</strong> (provisional, discarded after spectrum computation): FMPHGO. Tolerates overestimated capacity, compact, no need to optimise for query speed on a temporary structure.</p>
|
||||
<p><strong>Phase 2</strong> (persistent, queried repeatedly): open between FMPHGO and ptr_hash. Exact key count is available, so both operate optimally. ptr_hash's query speed advantage (2.1–3.3×) is meaningful for the persistent index but carries the risk of a very young crate. FMPHGO is the conservative default; ptr_hash is worth revisiting once it has broader production use.</p>
|
||||
<p>boomphf is effectively eliminated: its space overhead is the largest and its streaming-construction advantage does not apply here.</p>
|
||||
<hr />
|
||||
<h2 id="space-at-scale">Space at scale</h2>
|
||||
<p>For 1 024 partitions × 100 M kmers/partition:</p>
|
||||
<p>For 1 024 partitions × 100 M kmers/partition (phase 2 index, after filtering):</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -875,15 +1102,15 @@
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>In practice, partition sizes depend on the dataset. For a human genome at 30× coverage with p=10 (1 024 partitions), realistic partition sizes are 3–30 M kmers → 1–8 MB per MPHF, well within RAM.</p>
|
||||
<p>For a human genome at 30× coverage with 1 024 partitions, realistic partition sizes are 3–30 M unique kmers → 1–8 MB per phase-2 MPHF, well within RAM.</p>
|
||||
<h2 id="on-disk-and-mmap-considerations">On-disk and mmap considerations</h2>
|
||||
<p>All three are in-memory structures. Their internal representation is flat bit arrays (no heap pointers), making them serialisable as contiguous byte blobs and mmappable per partition. True zero-copy access would require rkyv integration; the <code>ph</code> crate currently uses serde, so loading involves a copy. Given per-partition MPHF sizes of 1–8 MB, the OS page cache handles this transparently — strict zero-copy is a refinement, not a blocker.</p>
|
||||
<p>No established Rust crate provides a natively on-disk MPHF. <strong>SSHash</strong> (Sparse and Skew Hash) is a complete kmer dictionary designed for disk access and is order-preserving (overlapping kmers receive consecutive indices → cache-friendly count access), but it is C++-only and covers more than just the MPHF layer.</p>
|
||||
<h2 id="open-questions">Open questions</h2>
|
||||
<ul>
|
||||
<li>Confirm actual partition sizes on representative metagenomic datasets before fixing the choice.</li>
|
||||
<li>Evaluate whether ptr_hash's query speed advantage (2.1–3.3×) justifies adopting a crate that is less than a year old.</li>
|
||||
<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary.</li>
|
||||
<li>Confirm actual partition sizes and overestimation factor on representative metagenomic datasets.</li>
|
||||
<li>Revisit ptr_hash for phase 2 once the crate has broader production track record.</li>
|
||||
<li>Assess rkyv integration cost for FMPHGO if true zero-copy mmap becomes necessary for the persistent index.</li>
|
||||
<li>Keep SSHash in mind if the indexing architecture is reconsidered at a higher level.</li>
|
||||
</ul>
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -12,7 +12,7 @@
|
||||
<link rel="prev" href="../chunkreader/">
|
||||
|
||||
|
||||
<link rel="next" href="../storage/">
|
||||
<link rel="next" href="../obipipeline/">
|
||||
|
||||
|
||||
|
||||
@@ -230,7 +230,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/indexing/" class="md-nav__link">
|
||||
|
||||
@@ -633,6 +661,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../storage/" class="md-nav__link">
|
||||
|
||||
@@ -683,6 +739,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../pipeline/">
|
||||
<link rel="prev" href="../obipipeline/">
|
||||
|
||||
|
||||
<link rel="next" href="../mphf/">
|
||||
@@ -230,7 +230,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/indexing/" class="md-nav__link">
|
||||
|
||||
@@ -507,6 +535,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -639,6 +695,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -230,7 +230,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../theory/indexing/" class="md-nav__link">
|
||||
|
||||
@@ -488,6 +516,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#minimizer-sliding-window" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Minimizer sliding window
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -600,6 +639,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../storage/" class="md-nav__link">
|
||||
|
||||
@@ -650,6 +717,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -796,6 +891,17 @@
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#minimizer-sliding-window" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Minimizer sliding window
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -828,7 +934,7 @@
|
||||
|
||||
<h1 id="superkmer-implementation">SuperKmer — implementation</h1>
|
||||
<h2 id="memory-layout">Memory layout</h2>
|
||||
<p>A super-kmer is stored as a <strong>32-bit header</strong> followed by a <strong>byte-aligned nucleotide sequence</strong> (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):</p>
|
||||
<p>A super-kmer is stored as a <strong>32-bit header</strong> followed by a <strong>byte-aligned nucleotide sequence</strong> (2 bits/base, nucleotide 0 at the MSB of the first byte):</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
@@ -844,21 +950,44 @@
|
||||
<td>Occurrence count (≤ 16 M)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SEQL</td>
|
||||
<td>NKMERS</td>
|
||||
<td>8</td>
|
||||
<td>Sequence length in nucleotides (1–256)</td>
|
||||
<td>Number of kmers (= seq_length − k + 1, range 1–255)</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Bit layout (MSB to LSB): <code>[31:8] COUNT [7:0] SEQL</code></p>
|
||||
<p>SEQL is stored as a raw <code>u8</code>: values 1–255 represent lengths 1–255; <strong>0 represents 256</strong> (wrapping convention). The public accessor returns a <code>usize</code> and performs the conversion:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="mi">256</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span>
|
||||
<p>Bit layout (MSB to LSB): <code>[31:8] COUNT [7:0] NKMERS</code></p>
|
||||
<p>NKMERS is stored as a raw <code>u8</code> in <strong>kmer units</strong>, not nucleotides. The nucleotide length is recovered as <code>NKMERS + k − 1</code>. This avoids the awkward wrapping convention (<code>0 = 256</code>) that would be needed if nucleotide length were stored directly, and gains k−1 = 30 units of headroom:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>unit</th>
|
||||
<th>u8 covers</th>
|
||||
<th>max nucleotides</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>nucleotides</td>
|
||||
<td>255 nt</td>
|
||||
<td>225 kmers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><strong>kmers</strong></td>
|
||||
<td><strong>255 kmers</strong></td>
|
||||
<td><strong>285 nt</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The public accessors:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">n_kmers</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">seql</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">usize</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">K</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">count</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u32</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">increment</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||||
<span class="k">fn</span><span class="w"> </span><span class="nf">set_count</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">:</span><span class="w"> </span><span class="kt">u32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="mi">0</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mh">0xFF</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">8</span><span class="p">);</span><span class="w"> </span><span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.</p>
|
||||
<p>In practice, observed super-kmer lengths on metagenomic data (k=31) are below 55 nucleotides (≤ 25 kmers) — far from the 255-kmer cap. If a super-kmer ever exceeds 255 kmers, it is split with a k−1 nucleotide overlap, preserving all kmers without duplication (identical mechanism to partition-boundary splits).</p>
|
||||
<p>The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.</p>
|
||||
<h2 id="ascii-encoding-and-decoding">ASCII encoding and decoding</h2>
|
||||
<p>Two lookup tables handle ASCII ↔ 2-bit conversion:</p>
|
||||
@@ -883,8 +1012,9 @@
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p><code>REVCOMP4</code> is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.</p>
|
||||
<p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 − SEQL × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice<u8, Msb0>::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">SEQL</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span>
|
||||
<p><strong>Step 2 — realignment.</strong> After step 1, <code>padding = n × 8 − seql × 2</code> spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using <code>BitSlice<u8, Msb0>::rotate_left(padding)</code> from the <code>bitvec</code> crate, which is SIMD-accelerated. The trailing <code>padding</code> bits are then zeroed:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="kd">let</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">n_kmers</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
|
||||
<span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">seql</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="c1">// number of padding bits</span>
|
||||
<span class="n">bits</span><span class="p">.</span><span class="n">rotate_left</span><span class="p">(</span><span class="n">shift</span><span class="p">)</span>
|
||||
<span class="n">bits</span><span class="p">[</span><span class="n">len</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">shift</span><span class="o">..</span><span class="p">].</span><span class="n">fill</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
|
||||
</code></pre></div>
|
||||
@@ -900,6 +1030,61 @@
|
||||
return seq -- palindrome: either orientation valid
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<h2 id="minimizer-sliding-window">Minimizer sliding window</h2>
|
||||
<p>Super-kmers are built by <code>SuperKmerIter</code> (crate <code>obiskbuilder</code>), which maintains the current minimizer with a <strong>monotonic deque</strong> over a sliding window of W = k − m + 1 m-mer positions.</p>
|
||||
<p>Each deque entry stores:</p>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Field</th>
|
||||
<th>Type</th>
|
||||
<th>Purpose</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>position</code></td>
|
||||
<td>usize</td>
|
||||
<td>0-based start of this m-mer in the segment</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>canonical</code></td>
|
||||
<td>u64</td>
|
||||
<td>right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>hash</code></td>
|
||||
<td>u64</td>
|
||||
<td><span class="arithmatex">\(H(\text{canonical})\)</span> — ordering key for random minimizer selection</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>The hash <span class="arithmatex">\(H\)</span> is the seeded splitmix64 finalizer (see <a href="../../theory/minimizer/">Minimizer selection</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">fn</span><span class="w"> </span><span class="nf">hash_mmer</span><span class="p">(</span><span class="n">canonical</span><span class="p">:</span><span class="w"> </span><span class="kt">u64</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="kt">u64</span><span class="w"> </span><span class="p">{</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">canonical</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="mh">0x9e3779b97f4a7c15</span><span class="p">;</span><span class="w"> </span><span class="c1">// seed: eliminates fixed point at 0</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">30</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0xbf58476d1ce4e5b9</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">27</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">wrapping_mul</span><span class="p">(</span><span class="mh">0x94d049bb133111eb</span><span class="p">);</span>
|
||||
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="mi">31</span><span class="p">)</span>
|
||||
<span class="p">}</span>
|
||||
</code></pre></div>
|
||||
<p>On each new nucleotide, once the window is full, the deque is updated:</p>
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — minimizer deque update</p>
|
||||
<div class="highlight"><pre><span></span><code>procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
|
||||
-- pop dominated entries from the back
|
||||
while deque.back.hash ≥ hash:
|
||||
deque.pop_back()
|
||||
deque.push_back({position, canonical, hash})
|
||||
|
||||
-- evict expired entries from the front
|
||||
while deque.front.position + k < received:
|
||||
deque.pop_front()
|
||||
</code></pre></div>
|
||||
</div>
|
||||
<p>The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.</p>
|
||||
<p>A super-kmer boundary is emitted when the minimizer changes: <code>deque.front.hash ≠ prev_hash</code>. The <code>canonical</code> field of the front entry is <strong>not</strong> used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key <span class="arithmatex">\(H(\text{canonical})\)</span> can be recomputed independently at routing time from the stored <code>minimizer_pos</code>, without inheriting the minimum-order-statistic bias (see <a href="../../theory/minimizer/#partition-key-independence">Minimizer selection — partition key independence</a>).</p>
|
||||
<h2 id="kmer-extraction">Kmer extraction</h2>
|
||||
<p>A k-mer is extracted from a super-kmer with <code>SuperKmer::kmer(i, k)</code>, which returns a <code>Kmer</code> — a left-aligned <code>u64</code> newtype (see <a href="../kmer/">Kmer implementation</a>):</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="k">pub</span><span class="w"> </span><span class="k">fn</span><span class="w"> </span><span class="nf">kmer</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">:</span><span class="w"> </span><span class="kt">usize</span><span class="p">)</span><span class="w"> </span><span class="p">-></span><span class="w"> </span><span class="nb">Result</span><span class="o"><</span><span class="n">Kmer</span><span class="p">,</span><span class="w"> </span><span class="n">KmerError</span><span class="o">></span>
|
||||
@@ -909,8 +1094,9 @@
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Algorithm — Super-kmer reverse complement</p>
|
||||
<div class="highlight"><pre><span></span><code>procedure SuperKmerRevcomp(seq, SEQL):
|
||||
n ← ⌈SEQL / 4⌉ -- number of bytes
|
||||
shift ← n × 8 − SEQL × 2 -- padding bits to flush
|
||||
seql ← NKMERS + k − 1 -- nucleotide length
|
||||
n ← ⌈seql / 4⌉ -- number of bytes
|
||||
shift ← n × 8 − seql × 2 -- padding bits to flush
|
||||
|
||||
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
|
||||
lo ← 0 ; hi ← n − 1
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user