<p>A <strong>minimizer</strong> of a k-mer window is the m-mer (m < k) with the smallest value under some total order ≺ among all k − m + 1 overlapping m-mers in the window. The minimizer is always taken in <strong>canonical form</strong> (lexicographic minimum of forward and reverse complement) to ensure strand-independence.</p>
<p>The minimizer partitions the sequence into <strong>super-kmers</strong>: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.</p>
<h2id="lexicographic-ordering-and-its-bias">Lexicographic ordering and its bias</h2>
<p>The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is <spanclass="arithmatex">\(\min_{\text{lex}}(\text{fwd}, \text{rc})\)</span>, so AT-rich m-mers have systematically small values:</p>
<p>Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.</p>
<h2id="random-minimizer">Random minimizer</h2>
<p>A <strong>random minimizer</strong> replaces lex order with a hash order: define <spanclass="arithmatex">\(H : \{0,1\}^{2m} \to \{0,1\}^{64}\)</span> and select the m-mer with the <strong>minimum <spanclass="arithmatex">\(H\)</span> value</strong> in the window.</p>
<p>The key property: because <spanclass="arithmatex">\(H\)</span> is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.</p>
<h2id="why-the-canonical-form-remains-lexicographic">Why the canonical form remains lexicographic</h2>
<p>An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:</p>
<p>This must be rejected. The hash of this new canonical is <spanclass="arithmatex">\(\min(H(\text{fwd}), H(\text{rc}))\)</span> — the minimum of two i.i.d. Uniform<spanclass="arithmatex">\([0, 2^{64})\)</span> values. Its distribution is:</p>
<p>with density <spanclass="arithmatex">\(f(x) = 2(1 - x/2^{64})\)</span>, which is approximately <strong>twice as large near 0 than near <spanclass="arithmatex">\(2^{64}\)</span></strong>. The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.</p>
<p>The lex canonical form does not have this problem: <spanclass="arithmatex">\(\text{canonical}_{\text{lex}}(v)\)</span> is a fixed, deterministic representative of each equivalence class, and <spanclass="arithmatex">\(H(\text{canonical}_{\text{lex}})\)</span> is uniformly distributed over <spanclass="arithmatex">\([0, 2^{64})\)</span> independently of the min/max relationship between the two strands.</p>
<p>A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the <strong>minimum</strong><spanclass="arithmatex">\(H\)</span> value in a window of <spanclass="arithmatex">\(W = k - m + 1\)</span> positions. The minimum of <spanclass="arithmatex">\(W\)</span> i.i.d. Uniform<spanclass="arithmatex">\([0,2^{64})\)</span> values has distribution:</p>
<p>concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.</p>
<p>The correct approach is to decouple selection from partition routing:</p>
<li><strong>Selection</strong> uses <spanclass="arithmatex">\(H(\text{canonical}_{\text{lex}}(m\text{-mer}))\)</span> to pick the minimizer in the window.</li>
<li><strong>Partition routing</strong> recomputes <spanclass="arithmatex">\(H(\text{canonical}_{\text{lex}}(\text{minimizer}))\)</span> from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over <spanclass="arithmatex">\([0, 2^{64})\)</span>.</li>
<p>Since <spanclass="arithmatex">\(\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0\)</span>, using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.</p>
<p>The fix is a non-zero XOR seed applied before mixing:</p>
<p>where <spanclass="arithmatex">\(\varphi\)</span> is the golden ratio. This maps 0 to <spanclass="arithmatex">\(\text{mix64}(s)\)</span>, a well-distributed non-zero value. No canonical m-mer value has a systematically small <spanclass="arithmatex">\(H\)</span>.</p>
<divclass="admonition abstract">
<pclass="admonition-title">Hash function <spanclass="arithmatex">\(H\)</span></p>
<scriptid="__config"type="application/json">{"annotate":null,"base":"../..","features":[],"search":"../../assets/javascripts/workers/search.2c215733.min.js","tags":null,"translations":{"clipboard.copied":"Copied to clipboard","clipboard.copy":"Copy to clipboard","search.result.more.one":"1 more on this page","search.result.more.other":"# more on this page","search.result.none":"No matching documents","search.result.one":"1 matching document","search.result.other":"# matching documents","search.result.placeholder":"Type to start searching","search.result.term.missing":"Missing","select.version":"Select version"},"version":null}</script>