Files
obikmer/doc/theory/minimizer/index.html
T

1256 lines
26 KiB
HTML
Raw Normal View History

2026-04-16 22:38:20 +02:00
<!doctype html>
<html lang="en" class="no-js">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="prev" href="../entropy/">
2026-04-16 22:38:20 +02:00
<link rel="next" href="../indexing/">
2026-04-16 22:38:20 +02:00
<link rel="icon" href="../../assets/images/favicon.png">
<meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.7.6">
<title>Minimizer selection - obikmer</title>
2026-04-16 22:38:20 +02:00
<link rel="stylesheet" href="../../assets/stylesheets/main.484c7ddc.min.css">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback">
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
</head>
<body dir="ltr">
<input class="md-toggle" data-md-toggle="drawer" type="checkbox" id="__drawer" autocomplete="off">
<input class="md-toggle" data-md-toggle="search" type="checkbox" id="__search" autocomplete="off">
<label class="md-overlay" for="__drawer"></label>
<div data-md-component="skip">
<a href="#minimizer-selection" class="md-skip">
2026-04-16 22:38:20 +02:00
Skip to content
</a>
</div>
<div data-md-component="announce">
</div>
<header class="md-header md-header--shadow" data-md-component="header">
<nav class="md-header__inner md-grid" aria-label="Header">
<a href="../.." title="obikmer" class="md-header__button md-logo" aria-label="obikmer" data-md-component="logo">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"/></svg>
</a>
<label class="md-header__button md-icon" for="__drawer">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"/></svg>
</label>
<div class="md-header__title" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
obikmer
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
Minimizer selection
2026-04-16 22:38:20 +02:00
</span>
</div>
</div>
</div>
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
</nav>
</header>
<div class="md-container" data-md-component="container">
<main class="md-main" data-md-component="main">
<div class="md-main__inner md-grid">
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--primary" aria-label="Navigation" data-md-level="0">
<label class="md-nav__title" for="__drawer">
<a href="../.." title="obikmer" class="md-nav__button md-logo" aria-label="obikmer" data-md-component="logo">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"/></svg>
</a>
obikmer
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../.." class="md-nav__link">
<span class="md-ellipsis">
Home
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_2" checked>
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
<span class="md-ellipsis">
Theory
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_2_label" aria-expanded="true">
<label class="md-nav__title" for="__nav_2">
<span class="md-nav__icon md-icon"></span>
Theory
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../kmers/" class="md-nav__link">
<span class="md-ellipsis">
Kmers and super-kmers
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../encoding/" class="md-nav__link">
<span class="md-ellipsis">
DNA encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../entropy/" class="md-nav__link">
<span class="md-ellipsis">
Entropy filter
</span>
</a>
</li>
2026-04-16 22:38:20 +02:00
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
Minimizer selection
2026-04-16 22:38:20 +02:00
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a href="./" class="md-nav__link md-nav__link--active">
<span class="md-ellipsis">
Minimizer selection
2026-04-16 22:38:20 +02:00
</span>
</a>
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#definition" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Definition
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Lexicographic ordering and its bias
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#random-minimizer" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Random minimizer
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Why the canonical form remains lexicographic
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#partition-key-independence" class="md-nav__link">
<span class="md-ellipsis">
Partition key independence
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
<span class="md-ellipsis">
Seed and fixed-point elimination
</span>
</a>
2026-04-16 22:38:20 +02:00
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="../indexing/" class="md-nav__link">
<span class="md-ellipsis">
Partitioning architecture
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3" >
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
<span class="md-ellipsis">
Implementation
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_3_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
Implementation
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../implementation/superkmer/" class="md-nav__link">
<span class="md-ellipsis">
SuperKmer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/kmer/" class="md-nav__link">
<span class="md-ellipsis">
Kmer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/chunkreader/" class="md-nav__link">
<span class="md-ellipsis">
Chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/pipeline/" class="md-nav__link">
<span class="md-ellipsis">
Construction pipeline
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obipipeline/" class="md-nav__link">
<span class="md-ellipsis">
obipipeline library
</span>
</a>
</li>
2026-04-16 22:38:20 +02:00
<li class="md-nav__item">
<a href="../../implementation/storage/" class="md-nav__link">
<span class="md-ellipsis">
On-disk storage
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/mphf/" class="md-nav__link">
<span class="md-ellipsis">
MPHF selection
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
<span class="md-ellipsis">
Unitig evidence encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/evidence_elimination/" class="md-nav__link">
<span class="md-ellipsis">
Evidence elimination (discussion)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obilayeredmap/" class="md-nav__link">
<span class="md-ellipsis">
obilayeredmap crate
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/persistent_compact_int_vec/" class="md-nav__link">
<span class="md-ellipsis">
PersistentCompactIntVec
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/persistent_bit_vec/" class="md-nav__link">
<span class="md-ellipsis">
PersistentBitVec
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/merge/" class="md-nav__link">
<span class="md-ellipsis">
Merge command
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/rebuild_filter/" class="md-nav__link">
<span class="md-ellipsis">
Kmer filtering (rebuild/dump/unitig)
</span>
</a>
</li>
2026-04-16 22:38:20 +02:00
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_4" >
<label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="0">
<span class="md-ellipsis">
Architecture
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_4_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_4">
<span class="md-nav__icon md-icon"></span>
Architecture
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../architecture/sequences/invariant/" class="md-nav__link">
<span class="md-ellipsis">
Sequences
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../architecture/index_architecture/" class="md-nav__link">
<span class="md-ellipsis">
Kmer index
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#definition" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Definition
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Lexicographic ordering and its bias
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#random-minimizer" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Random minimizer
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
2026-04-16 22:38:20 +02:00
<span class="md-ellipsis">
Why the canonical form remains lexicographic
2026-04-16 22:38:20 +02:00
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#partition-key-independence" class="md-nav__link">
<span class="md-ellipsis">
Partition key independence
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
<span class="md-ellipsis">
Seed and fixed-point elimination
</span>
</a>
2026-04-16 22:38:20 +02:00
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="minimizer-selection">Minimizer selection</h1>
<h2 id="definition">Definition</h2>
<p>A <strong>minimizer</strong> of a k-mer window is the m-mer (m &lt; k) with the smallest value under some total order ≺ among all k m + 1 overlapping m-mers in the window. The minimizer is always taken in <strong>canonical form</strong> (lexicographic minimum of forward and reverse complement) to ensure strand-independence.</p>
<p>The minimizer partitions the sequence into <strong>super-kmers</strong>: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.</p>
<h2 id="lexicographic-ordering-and-its-bias">Lexicographic ordering and its bias</h2>
<p>The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is <span class="arithmatex">\(\min_{\text{lex}}(\text{fwd}, \text{rc})\)</span>, so AT-rich m-mers have systematically small values:</p>
<div class="arithmatex">\[\text{canonical}(\text{AAAA}\cdots\text{A}) = \text{canonical}(\text{TTTT}\cdots\text{T}) = 0\]</div>
<p>Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.</p>
<h2 id="random-minimizer">Random minimizer</h2>
<p>A <strong>random minimizer</strong> replaces lex order with a hash order: define <span class="arithmatex">\(H : \{0,1\}^{2m} \to \{0,1\}^{64}\)</span> and select the m-mer with the <strong>minimum <span class="arithmatex">\(H\)</span> value</strong> in the window.</p>
<p>The key property: because <span class="arithmatex">\(H\)</span> is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.</p>
<h2 id="why-the-canonical-form-remains-lexicographic">Why the canonical form remains lexicographic</h2>
<p>An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:</p>
<div class="arithmatex">\[\text{canonical}_H(v) = \arg\min(H(\text{fwd}),\ H(\text{rc}))\]</div>
<p>This must be rejected. The hash of this new canonical is <span class="arithmatex">\(\min(H(\text{fwd}), H(\text{rc}))\)</span> — the minimum of two i.i.d. Uniform<span class="arithmatex">\([0, 2^{64})\)</span> values. Its distribution is:</p>
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^2\]</div>
<p>with density <span class="arithmatex">\(f(x) = 2(1 - x/2^{64})\)</span>, which is approximately <strong>twice as large near 0 than near <span class="arithmatex">\(2^{64}\)</span></strong>. The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.</p>
<p>The lex canonical form does not have this problem: <span class="arithmatex">\(\text{canonical}_{\text{lex}}(v)\)</span> is a fixed, deterministic representative of each equivalence class, and <span class="arithmatex">\(H(\text{canonical}_{\text{lex}})\)</span> is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span> independently of the min/max relationship between the two strands.</p>
<h2 id="partition-key-independence">Partition key independence</h2>
<p>A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the <strong>minimum</strong> <span class="arithmatex">\(H\)</span> value in a window of <span class="arithmatex">\(W = k - m + 1\)</span> positions. The minimum of <span class="arithmatex">\(W\)</span> i.i.d. Uniform<span class="arithmatex">\([0,2^{64})\)</span> values has distribution:</p>
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^W \approx \frac{Wx}{2^{64}}\]</div>
<p>concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.</p>
<p>The correct approach is to decouple selection from partition routing:</p>
2026-04-16 22:38:20 +02:00
<ul>
<li><strong>Selection</strong> uses <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(m\text{-mer}))\)</span> to pick the minimizer in the window.</li>
<li><strong>Partition routing</strong> recomputes <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(\text{minimizer}))\)</span> from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span>.</li>
2026-04-16 22:38:20 +02:00
</ul>
<h2 id="seed-and-fixed-point-elimination">Seed and fixed-point elimination</h2>
<p>The splitmix64 finalizer has a fixed point at 0:</p>
<div class="arithmatex">\[\text{mix64}(0) = 0\]</div>
<p>Since <span class="arithmatex">\(\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0\)</span>, using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.</p>
<p>The fix is a non-zero XOR seed applied before mixing:</p>
<div class="arithmatex">\[H(x) = \text{mix64}(x \oplus s), \quad s = \lfloor 2^{64}/\varphi \rfloor = \texttt{0x9e3779b97f4a7c15}\]</div>
<p>where <span class="arithmatex">\(\varphi\)</span> is the golden ratio. This maps 0 to <span class="arithmatex">\(\text{mix64}(s)\)</span>, a well-distributed non-zero value. No canonical m-mer value has a systematically small <span class="arithmatex">\(H\)</span>.</p>
<div class="admonition abstract">
<p class="admonition-title">Hash function <span class="arithmatex">\(H\)</span></p>
<div class="highlight"><pre><span></span><code>H(x):
x ← x ⊕ 0x9e3779b97f4a7c15
x ← x ⊕ (x &gt;&gt; 30)
x ← x × 0xbf58476d1ce4e5b9
x ← x ⊕ (x &gt;&gt; 27)
x ← x × 0x94d049bb133111eb
return x ⊕ (x &gt;&gt; 31)
2026-04-16 22:38:20 +02:00
</code></pre></div>
</div>
</article>
</div>
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
</div>
</main>
<footer class="md-footer">
<div class="md-footer-meta md-typeset">
<div class="md-footer-meta__inner md-grid">
<div class="md-copyright">
Made with
<a href="https://squidfunk.github.io/mkdocs-material/" target="_blank" rel="noopener">
Material for MkDocs
</a>
</div>
</div>
</div>
</footer>
</div>
<div class="md-dialog" data-md-component="dialog">
<div class="md-dialog__inner md-typeset"></div>
</div>
<script id="__config" type="application/json">{"annotate": null, "base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.2c215733.min.js", "tags": null, "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}, "version": null}</script>
<script src="../../assets/javascripts/bundle.79ae519e.min.js"></script>
<script src="https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js"></script>
</body>
</html>