Files
obikmer/doc/theory/minimizer/index.html
T
Eric Coissac 27f5e88a7b refactor: implement RoutableSuperKmer and update k-mer indexing pipeline
Replace raw SuperkMer routing with a new RoutableSuperKimer type that embeds canonical sequences and precomputed minimizers, enabling direct partition routing via hash. Update the build pipeline to yield RoutableSuperKmers throughout (builder, scatterer), refactor FASTA/unitig export commands to use the new type and compressed outputs (.fasta.gz, .unitigs.fasta.zst), revise SuperKmer header to store n_kmers instead of seql (avoiding 256-byte wrap), and update documentation to reflect minimizer-based theory, two evidence-encoding strategies for unitig-MPHF indexing (global offset vs. ID+rank), and the new obipipeline library architecture with parallel workers, biased scheduling, and error handling.
2026-05-01 09:33:26 +02:00

1060 lines
24 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!doctype html>
<html lang="en" class="no-js">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="prev" href="../entropy/">
<link rel="next" href="../indexing/">
<link rel="icon" href="../../assets/images/favicon.png">
<meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.7.6">
<title>Minimizer selection - obikmer</title>
<link rel="stylesheet" href="../../assets/stylesheets/main.484c7ddc.min.css">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback">
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
</head>
<body dir="ltr">
<input class="md-toggle" data-md-toggle="drawer" type="checkbox" id="__drawer" autocomplete="off">
<input class="md-toggle" data-md-toggle="search" type="checkbox" id="__search" autocomplete="off">
<label class="md-overlay" for="__drawer"></label>
<div data-md-component="skip">
<a href="#minimizer-selection" class="md-skip">
Skip to content
</a>
</div>
<div data-md-component="announce">
</div>
<header class="md-header md-header--shadow" data-md-component="header">
<nav class="md-header__inner md-grid" aria-label="Header">
<a href="../.." title="obikmer" class="md-header__button md-logo" aria-label="obikmer" data-md-component="logo">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"/></svg>
</a>
<label class="md-header__button md-icon" for="__drawer">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"/></svg>
</label>
<div class="md-header__title" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
obikmer
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
Minimizer selection
</span>
</div>
</div>
</div>
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
</nav>
</header>
<div class="md-container" data-md-component="container">
<main class="md-main" data-md-component="main">
<div class="md-main__inner md-grid">
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--primary" aria-label="Navigation" data-md-level="0">
<label class="md-nav__title" for="__drawer">
<a href="../.." title="obikmer" class="md-nav__button md-logo" aria-label="obikmer" data-md-component="logo">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3 3 3 0 0 0 3 3m0 3.54C9.64 9.35 6.5 8 3 8v11c3.5 0 6.64 1.35 9 3.54 2.36-2.19 5.5-3.54 9-3.54V8c-3.5 0-6.64 1.35-9 3.54"/></svg>
</a>
obikmer
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../.." class="md-nav__link">
<span class="md-ellipsis">
Home
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_2" checked>
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
<span class="md-ellipsis">
Theory
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_2_label" aria-expanded="true">
<label class="md-nav__title" for="__nav_2">
<span class="md-nav__icon md-icon"></span>
Theory
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../kmers/" class="md-nav__link">
<span class="md-ellipsis">
Kmers and super-kmers
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../encoding/" class="md-nav__link">
<span class="md-ellipsis">
DNA encoding
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../entropy/" class="md-nav__link">
<span class="md-ellipsis">
Entropy filter
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
Minimizer selection
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a href="./" class="md-nav__link md-nav__link--active">
<span class="md-ellipsis">
Minimizer selection
</span>
</a>
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#definition" class="md-nav__link">
<span class="md-ellipsis">
Definition
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
<span class="md-ellipsis">
Lexicographic ordering and its bias
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#random-minimizer" class="md-nav__link">
<span class="md-ellipsis">
Random minimizer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
<span class="md-ellipsis">
Why the canonical form remains lexicographic
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#partition-key-independence" class="md-nav__link">
<span class="md-ellipsis">
Partition key independence
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
<span class="md-ellipsis">
Seed and fixed-point elimination
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="../indexing/" class="md-nav__link">
<span class="md-ellipsis">
Partitioning architecture
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3" >
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
<span class="md-ellipsis">
Implementation
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_3_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
Implementation
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../implementation/superkmer/" class="md-nav__link">
<span class="md-ellipsis">
SuperKmer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/kmer/" class="md-nav__link">
<span class="md-ellipsis">
Kmer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/chunkreader/" class="md-nav__link">
<span class="md-ellipsis">
Chunk reader
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/pipeline/" class="md-nav__link">
<span class="md-ellipsis">
Construction pipeline
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/obipipeline/" class="md-nav__link">
<span class="md-ellipsis">
obipipeline library
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/storage/" class="md-nav__link">
<span class="md-ellipsis">
On-disk storage
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/mphf/" class="md-nav__link">
<span class="md-ellipsis">
MPHF selection
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../../implementation/unitig_evidence/" class="md-nav__link">
<span class="md-ellipsis">
Unitig evidence encoding
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_4" >
<label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="0">
<span class="md-ellipsis">
Architecture
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_4_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_4">
<span class="md-nav__icon md-icon"></span>
Architecture
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../architecture/sequences/invariant/" class="md-nav__link">
<span class="md-ellipsis">
Sequences
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#definition" class="md-nav__link">
<span class="md-ellipsis">
Definition
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#lexicographic-ordering-and-its-bias" class="md-nav__link">
<span class="md-ellipsis">
Lexicographic ordering and its bias
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#random-minimizer" class="md-nav__link">
<span class="md-ellipsis">
Random minimizer
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#why-the-canonical-form-remains-lexicographic" class="md-nav__link">
<span class="md-ellipsis">
Why the canonical form remains lexicographic
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#partition-key-independence" class="md-nav__link">
<span class="md-ellipsis">
Partition key independence
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#seed-and-fixed-point-elimination" class="md-nav__link">
<span class="md-ellipsis">
Seed and fixed-point elimination
</span>
</a>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="minimizer-selection">Minimizer selection</h1>
<h2 id="definition">Definition</h2>
<p>A <strong>minimizer</strong> of a k-mer window is the m-mer (m &lt; k) with the smallest value under some total order ≺ among all k m + 1 overlapping m-mers in the window. The minimizer is always taken in <strong>canonical form</strong> (lexicographic minimum of forward and reverse complement) to ensure strand-independence.</p>
<p>The minimizer partitions the sequence into <strong>super-kmers</strong>: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.</p>
<h2 id="lexicographic-ordering-and-its-bias">Lexicographic ordering and its bias</h2>
<p>The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is <span class="arithmatex">\(\min_{\text{lex}}(\text{fwd}, \text{rc})\)</span>, so AT-rich m-mers have systematically small values:</p>
<div class="arithmatex">\[\text{canonical}(\text{AAAA}\cdots\text{A}) = \text{canonical}(\text{TTTT}\cdots\text{T}) = 0\]</div>
<p>Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.</p>
<h2 id="random-minimizer">Random minimizer</h2>
<p>A <strong>random minimizer</strong> replaces lex order with a hash order: define <span class="arithmatex">\(H : \{0,1\}^{2m} \to \{0,1\}^{64}\)</span> and select the m-mer with the <strong>minimum <span class="arithmatex">\(H\)</span> value</strong> in the window.</p>
<p>The key property: because <span class="arithmatex">\(H\)</span> is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.</p>
<h2 id="why-the-canonical-form-remains-lexicographic">Why the canonical form remains lexicographic</h2>
<p>An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:</p>
<div class="arithmatex">\[\text{canonical}_H(v) = \arg\min(H(\text{fwd}),\ H(\text{rc}))\]</div>
<p>This must be rejected. The hash of this new canonical is <span class="arithmatex">\(\min(H(\text{fwd}), H(\text{rc}))\)</span> — the minimum of two i.i.d. Uniform<span class="arithmatex">\([0, 2^{64})\)</span> values. Its distribution is:</p>
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^2\]</div>
<p>with density <span class="arithmatex">\(f(x) = 2(1 - x/2^{64})\)</span>, which is approximately <strong>twice as large near 0 than near <span class="arithmatex">\(2^{64}\)</span></strong>. The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.</p>
<p>The lex canonical form does not have this problem: <span class="arithmatex">\(\text{canonical}_{\text{lex}}(v)\)</span> is a fixed, deterministic representative of each equivalence class, and <span class="arithmatex">\(H(\text{canonical}_{\text{lex}})\)</span> is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span> independently of the min/max relationship between the two strands.</p>
<h2 id="partition-key-independence">Partition key independence</h2>
<p>A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the <strong>minimum</strong> <span class="arithmatex">\(H\)</span> value in a window of <span class="arithmatex">\(W = k - m + 1\)</span> positions. The minimum of <span class="arithmatex">\(W\)</span> i.i.d. Uniform<span class="arithmatex">\([0,2^{64})\)</span> values has distribution:</p>
<div class="arithmatex">\[F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^W \approx \frac{Wx}{2^{64}}\]</div>
<p>concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.</p>
<p>The correct approach is to decouple selection from partition routing:</p>
<ul>
<li><strong>Selection</strong> uses <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(m\text{-mer}))\)</span> to pick the minimizer in the window.</li>
<li><strong>Partition routing</strong> recomputes <span class="arithmatex">\(H(\text{canonical}_{\text{lex}}(\text{minimizer}))\)</span> from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over <span class="arithmatex">\([0, 2^{64})\)</span>.</li>
</ul>
<h2 id="seed-and-fixed-point-elimination">Seed and fixed-point elimination</h2>
<p>The splitmix64 finalizer has a fixed point at 0:</p>
<div class="arithmatex">\[\text{mix64}(0) = 0\]</div>
<p>Since <span class="arithmatex">\(\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0\)</span>, using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.</p>
<p>The fix is a non-zero XOR seed applied before mixing:</p>
<div class="arithmatex">\[H(x) = \text{mix64}(x \oplus s), \quad s = \lfloor 2^{64}/\varphi \rfloor = \texttt{0x9e3779b97f4a7c15}\]</div>
<p>where <span class="arithmatex">\(\varphi\)</span> is the golden ratio. This maps 0 to <span class="arithmatex">\(\text{mix64}(s)\)</span>, a well-distributed non-zero value. No canonical m-mer value has a systematically small <span class="arithmatex">\(H\)</span>.</p>
<div class="admonition abstract">
<p class="admonition-title">Hash function <span class="arithmatex">\(H\)</span></p>
<div class="highlight"><pre><span></span><code>H(x):
x ← x ⊕ 0x9e3779b97f4a7c15
x ← x ⊕ (x &gt;&gt; 30)
x ← x × 0xbf58476d1ce4e5b9
x ← x ⊕ (x &gt;&gt; 27)
x ← x × 0x94d049bb133111eb
return x ⊕ (x &gt;&gt; 31)
</code></pre></div>
</div>
</article>
</div>
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
</div>
</main>
<footer class="md-footer">
<div class="md-footer-meta md-typeset">
<div class="md-footer-meta__inner md-grid">
<div class="md-copyright">
Made with
<a href="https://squidfunk.github.io/mkdocs-material/" target="_blank" rel="noopener">
Material for MkDocs
</a>
</div>
</div>
</div>
</footer>
</div>
<div class="md-dialog" data-md-component="dialog">
<div class="md-dialog__inner md-typeset"></div>
</div>
<script id="__config" type="application/json">{"annotate": null, "base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.2c215733.min.js", "tags": null, "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}, "version": null}</script>
<script src="../../assets/javascripts/bundle.79ae519e.min.js"></script>
<script src="https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js"></script>
</body>
</html>