refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
This commit is contained in:
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../kmers/">
|
||||
<link rel="prev" href="../../kmers/">
|
||||
|
||||
|
||||
<link rel="next" href="../entropy/">
|
||||
@@ -232,7 +232,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -384,6 +384,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../indexing/" class="md-nav__link">
|
||||
|
||||
@@ -578,6 +606,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/storage/" class="md-nav__link">
|
||||
|
||||
@@ -628,6 +684,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
<link rel="prev" href="../encoding/">
|
||||
|
||||
|
||||
<link rel="next" href="../indexing/">
|
||||
<link rel="next" href="../minimizer/">
|
||||
|
||||
|
||||
|
||||
@@ -232,7 +232,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -439,6 +439,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../indexing/" class="md-nav__link">
|
||||
|
||||
@@ -633,6 +661,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/storage/" class="md-nav__link">
|
||||
|
||||
@@ -683,6 +739,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../entropy/">
|
||||
<link rel="prev" href="../minimizer/">
|
||||
|
||||
|
||||
<link rel="next" href="../../implementation/superkmer/">
|
||||
@@ -232,7 +232,7 @@
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../kmers/" class="md-nav__link">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -313,6 +313,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../minimizer/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -578,6 +606,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/storage/" class="md-nav__link">
|
||||
|
||||
@@ -628,6 +684,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
|
||||
@@ -9,10 +9,10 @@
|
||||
|
||||
|
||||
|
||||
<link rel="prev" href="../..">
|
||||
<link rel="prev" href="../entropy/">
|
||||
|
||||
|
||||
<link rel="next" href="../encoding/">
|
||||
<link rel="next" href="../indexing/">
|
||||
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@
|
||||
|
||||
|
||||
|
||||
<title>Kmers and super-kmers - obikmer</title>
|
||||
<title>Minimizer selection - obikmer</title>
|
||||
|
||||
|
||||
|
||||
@@ -64,7 +64,7 @@
|
||||
<div data-md-component="skip">
|
||||
|
||||
|
||||
<a href="#kmers-and-super-kmers" class="md-skip">
|
||||
<a href="#minimizer-selection" class="md-skip">
|
||||
Skip to content
|
||||
</a>
|
||||
|
||||
@@ -100,7 +100,7 @@
|
||||
<div class="md-header__topic" data-md-component="header-topic">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Kmers and super-kmers
|
||||
Minimizer selection
|
||||
|
||||
</span>
|
||||
</div>
|
||||
@@ -229,37 +229,10 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item md-nav__item--active">
|
||||
|
||||
<input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Kmers and super-kmers
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
<li class="md-nav__item">
|
||||
<a href="../../kmers/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
@@ -275,76 +248,6 @@
|
||||
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#kmers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Kmers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#super-kmers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Super-kmers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Super-kmers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-super-kmers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical super-kmers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#expected-length-of-a-super-kmer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Expected length of a super-kmer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
@@ -410,6 +313,147 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item md-nav__item--active">
|
||||
|
||||
<input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="./" class="md-nav__link md-nav__link--active">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Minimizer selection
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#definition" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Definition
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Lexicographic ordering and its bias
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#random-minimizer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Random minimizer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the canonical form remains lexicographic
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#partition-key-independence" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Partition key independence
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Seed and fixed-point elimination
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
@@ -606,6 +650,34 @@
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/obipipeline/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
obipipeline library
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/storage/" class="md-nav__link">
|
||||
|
||||
@@ -656,6 +728,34 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
|
||||
|
||||
|
||||
|
||||
<span class="md-ellipsis">
|
||||
|
||||
|
||||
Unitig evidence encoding
|
||||
|
||||
|
||||
|
||||
</span>
|
||||
|
||||
|
||||
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
@@ -772,10 +872,10 @@
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#kmers" class="md-nav__link">
|
||||
<a href="#definition" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Kmers
|
||||
Definition
|
||||
|
||||
</span>
|
||||
</a>
|
||||
@@ -783,41 +883,57 @@
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#super-kmers" class="md-nav__link">
|
||||
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Super-kmers
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
<nav class="md-nav" aria-label="Super-kmers">
|
||||
<ul class="md-nav__list">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#canonical-super-kmers" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Canonical super-kmers
|
||||
Lexicographic ordering and its bias
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#expected-length-of-a-super-kmer" class="md-nav__link">
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#random-minimizer" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Expected length of a super-kmer
|
||||
Random minimizer
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Why the canonical form remains lexicographic
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#partition-key-independence" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Partition key independence
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
|
||||
Seed and fixed-point elimination
|
||||
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
@@ -838,37 +954,50 @@
|
||||
|
||||
|
||||
|
||||
<h1 id="kmers-and-super-kmers">Kmers and super-kmers</h1>
|
||||
<h2 id="kmers">Kmers</h2>
|
||||
<p>A <strong>kmer</strong> is a DNA subsequence of fixed length k. Two constraints govern the choice of k:</p>
|
||||
<h1 id="minimizer-selection">Minimizer selection</h1>
|
||||
<h2 id="definition">Definition</h2>
|
||||
<p>A <strong>minimizer</strong> of a k-mer window is the m-mer (m < k) with the smallest value under some total order ≺ among all k − m + 1 overlapping m-mers in the window. The minimizer is always taken in <strong>canonical form</strong> (lexicographic minimum of forward and reverse complement) to ensure strand-independence.</p>
|
||||
<p>The minimizer partitions the sequence into <strong>super-kmers</strong>: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.</p>
|
||||
<h2 id="lexicographic-ordering-and-its-bias">Lexicographic ordering and its bias</h2>
|
||||
<p>The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is <span class="arithmatex">\(\min_{\text{lex}}(\text{fwd}, \text{rc})\)</span>, so AT-rich m-mers have systematically small values:</p>
|
||||
<div class="arithmatex">\[\text{canonical}(\text{AAAA}\cdots\text{A}) = \text{canonical}(\text{TTTT}\cdots\text{T}) = 0\]</div>
|
||||
<p>Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.</p>
|
||||
<h2 id="random-minimizer">Random minimizer</h2>
|
||||
<p>A <strong>random minimizer</strong> replaces lex order with a hash order: define <span class="arithmatex">\(H : \{0,1\}^{2m} \to \{0,1\}^{64}\)</span> and select the m-mer with the <strong>minimum <span class="arithmatex">\(H\)</span> value</strong> in the window.</p>
|
||||
<p>The key property: because <span class="arithmatex">\(H\)</span> is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.</p>
|
||||
<h2 id="why-the-canonical-form-remains-lexicographic">Why the canonical form remains lexicographic</h2>
|
||||
<p>An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:</p>
|
||||
<div class="arithmatex">\[\text{canonical}_H(v) = \arg\min(H(\text{fwd}),\ H(\text{rc}))\]</div>
|
||||
<p>This must be rejected. The hash of this new canonical is <span class="arithmatex">\(\min(H(\text{fwd}), H(\text{rc}))\)</span> — the minimum of two i.i.d. Uniform<span class="arithmatex">\([0, 2^{64})\)</span> values. Its distribution is:</p>
|
||||
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^2\]</div>
|
||||
<p>with density <span class="arithmatex">\(f(x) = 2(1 - x/2^{64})\)</span>, which is approximately <strong>twice as large near 0 than near <span class="arithmatex">\(2^{64}\)</span></strong>. The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.</p>
|
||||
<p>The lex canonical form does not have this problem: <span class="arithmatex">\(\text{canonical}_{\text{lex}}(v)\)</span> is a fixed, deterministic representative of each equivalence class, and <span class="arithmatex">\(H(\text{canonical}_{\text{lex}})\)</span> is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span> independently of the min/max relationship between the two strands.</p>
|
||||
<h2 id="partition-key-independence">Partition key independence</h2>
|
||||
<p>A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the <strong>minimum</strong> <span class="arithmatex">\(H\)</span> value in a window of <span class="arithmatex">\(W = k - m + 1\)</span> positions. The minimum of <span class="arithmatex">\(W\)</span> i.i.d. Uniform<span class="arithmatex">\([0,2^{64})\)</span> values has distribution:</p>
|
||||
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^W \approx \frac{Wx}{2^{64}}\]</div>
|
||||
<p>concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.</p>
|
||||
<p>The correct approach is to decouple selection from partition routing:</p>
|
||||
<ul>
|
||||
<li><strong>k ∈ [11, 31]</strong>: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.</li>
|
||||
<li><strong>k is odd</strong>: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form <code>min(kmer, revcomp(kmer))</code> is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.</li>
|
||||
<li><strong>Selection</strong> uses <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(m\text{-mer}))\)</span> to pick the minimizer in the window.</li>
|
||||
<li><strong>Partition routing</strong> recomputes <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(\text{minimizer}))\)</span> from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span>.</li>
|
||||
</ul>
|
||||
<h2 id="super-kmers">Super-kmers</h2>
|
||||
<p>A <strong>super-kmer</strong> is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides. Each kmer of the run carries the same <strong>canonical minimizer</strong>. The <strong>canonical minimizer</strong> of a kmer is the smallest value of <code>min(m-mer, revcomp(m-mer))</code> over all m-mers within the kmer (m < k, m odd).</p>
|
||||
<h3 id="canonical-super-kmers">Canonical super-kmers</h3>
|
||||
<p>A <strong>canonical super-kmer</strong> is the lexicographic minimum of a super-kmer and its reverse complement:</p>
|
||||
<div class="highlight"><pre><span></span><code>canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
|
||||
<h2 id="seed-and-fixed-point-elimination">Seed and fixed-point elimination</h2>
|
||||
<p>The splitmix64 finalizer has a fixed point at 0:</p>
|
||||
<div class="arithmatex">\[\text{mix64}(0) = 0\]</div>
|
||||
<p>Since <span class="arithmatex">\(\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0\)</span>, using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.</p>
|
||||
<p>The fix is a non-zero XOR seed applied before mixing:</p>
|
||||
<div class="arithmatex">\[H(x) = \text{mix64}(x \oplus s), \quad s = \lfloor 2^{64}/\varphi \rfloor = \texttt{0x9e3779b97f4a7c15}\]</div>
|
||||
<p>where <span class="arithmatex">\(\varphi\)</span> is the golden ratio. This maps 0 to <span class="arithmatex">\(\text{mix64}(s)\)</span>, a well-distributed non-zero value. No canonical m-mer value has a systematically small <span class="arithmatex">\(H\)</span>.</p>
|
||||
<div class="admonition abstract">
|
||||
<p class="admonition-title">Hash function <span class="arithmatex">\(H\)</span></p>
|
||||
<div class="highlight"><pre><span></span><code>H(x):
|
||||
x ← x ⊕ 0x9e3779b97f4a7c15
|
||||
x ← x ⊕ (x >> 30)
|
||||
x ← x × 0xbf58476d1ce4e5b9
|
||||
x ← x ⊕ (x >> 27)
|
||||
x ← x × 0x94d049bb133111eb
|
||||
return x ⊕ (x >> 31)
|
||||
</code></pre></div>
|
||||
<p>When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.</p>
|
||||
<h3 id="expected-length-of-a-super-kmer">Expected length of a super-kmer</h3>
|
||||
<p>For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) (Golan & Shur 2025; Zheng <em>et al.</em> 2020)<sup id="fnref:Zheng2020-ji"><a class="footnote-ref" href="#fn:Zheng2020-ji">2</a></sup> <sup id="fnref:Golan2025-xf"><a class="footnote-ref" href="#fn:Golan2025-xf">3</a></sup>, so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:</p>
|
||||
<div class="arithmatex">\[L_{\text{nt}} = \frac{k-m+2}{2} + k - 1\]</div>
|
||||
<p>For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.<sup id="fnref:superkmer_length"><a class="footnote-ref" href="#fn:superkmer_length">1</a></sup></p>
|
||||
<div class="footnote">
|
||||
<hr />
|
||||
<ol>
|
||||
<li id="fn:superkmer_length">
|
||||
<p>The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in (Zheng <em>et al.</em> 2020)<sup id="fnref2:Zheng2020-ji"><a class="footnote-ref" href="#fn:Zheng2020-ji">2</a></sup> and (Golan & Shur 2025)<sup id="fnref2:Golan2025-xf"><a class="footnote-ref" href="#fn:Golan2025-xf">3</a></sup>. <a class="footnote-backref" href="#fnref:superkmer_length" title="Jump back to footnote 1 in the text">↩</a></p>
|
||||
</li>
|
||||
<li id="fn:Zheng2020-ji">
|
||||
<p>Zheng, H., Kingsford, C. & Marçais, G. (2020). <a href="https://doi.org/10.1093/bioinformatics/btaa472">Improved design and analysis of practical minimizers</a>. <em>Bioinformatics (Oxford, England)</em>, 36, i119--i127. <a class="footnote-backref" href="#fnref:Zheng2020-ji" title="Jump back to footnote 2 in the text">↩</a><a class="footnote-backref" href="#fnref2:Zheng2020-ji" title="Jump back to footnote 2 in the text">↩</a></p>
|
||||
</li>
|
||||
<li id="fn:Golan2025-xf">
|
||||
<p>Golan, S. & Shur, A.M. (2025). <a href="https://doi.org/10.1007/978-3-031-82670-2\_25">Expected density of random minimizers</a>. In: <em>Lecture notes in computer science</em>, Lecture notes in computer science. Springer Nature Switzerland, Cham, pp. 347--360. <a class="footnote-backref" href="#fnref:Golan2025-xf" title="Jump back to footnote 3 in the text">↩</a><a class="footnote-backref" href="#fnref2:Golan2025-xf" title="Jump back to footnote 3 in the text">↩</a></p>
|
||||
</li>
|
||||
</ol>
|
||||
</div>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user