Minimizer selection
Definition
A minimizer of a k-mer window is the m-mer (m < k) with the smallest value under some total order ≺ among all k − m + 1 overlapping m-mers in the window. The minimizer is always taken in canonical form (lexicographic minimum of forward and reverse complement) to ensure strand-independence.
The minimizer partitions the sequence into super-kmers: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.
Lexicographic ordering and its bias
The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is \(\min_{\text{lex}}(\text{fwd}, \text{rc})\), so AT-rich m-mers have systematically small values:
Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.
Random minimizer
A random minimizer replaces lex order with a hash order: define \(H : \{0,1\}^{2m} \to \{0,1\}^{64}\) and select the m-mer with the minimum \(H\) value in the window.
The key property: because \(H\) is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.
Why the canonical form remains lexicographic
An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:
This must be rejected. The hash of this new canonical is \(\min(H(\text{fwd}), H(\text{rc}))\) — the minimum of two i.i.d. Uniform\([0, 2^{64})\) values. Its distribution is:
with density \(f(x) = 2(1 - x/2^{64})\), which is approximately twice as large near 0 than near \(2^{64}\). The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.
The lex canonical form does not have this problem: \(\text{canonical}_{\text{lex}}(v)\) is a fixed, deterministic representative of each equivalence class, and \(H(\text{canonical}_{\text{lex}})\) is uniformly distributed over \([0, 2^{64})\) independently of the min/max relationship between the two strands.
Partition key independence
A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the minimum \(H\) value in a window of \(W = k - m + 1\) positions. The minimum of \(W\) i.i.d. Uniform\([0,2^{64})\) values has distribution:
concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.
The correct approach is to decouple selection from partition routing:
- Selection uses \(H(\text{canonical}_{\text{lex}}(m\text{-mer}))\) to pick the minimizer in the window.
- Partition routing recomputes \(H(\text{canonical}_{\text{lex}}(\text{minimizer}))\) from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over \([0, 2^{64})\).
Seed and fixed-point elimination
The splitmix64 finalizer has a fixed point at 0:
Since \(\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0\), using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.
The fix is a non-zero XOR seed applied before mixing:
where \(\varphi\) is the golden ratio. This maps 0 to \(\text{mix64}(s)\), a well-distributed non-zero value. No canonical m-mer value has a systematically small \(H\).
Hash function \(H\)
H(x):
x ← x ⊕ 0x9e3779b97f4a7c15
x ← x ⊕ (x >> 30)
x ← x × 0xbf58476d1ce4e5b9
x ← x ⊕ (x >> 27)
x ← x × 0x94d049bb133111eb
return x ⊕ (x >> 31)