🔧 Replace degenerate minimizer logic with hash-based random ordering
- Add `hash` field to MmerItem for stable, randomized minimizer ordering - Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T) - Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only - Update push logic: compare hashes instead of canonical values with degeneracy checks
This commit is contained in:
@@ -82,6 +82,50 @@ bits[len - shift..].fill(false)
|
||||
return seq -- palindrome: either orientation valid
|
||||
```
|
||||
|
||||
## Minimizer sliding window
|
||||
|
||||
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k − m + 1 m-mer positions.
|
||||
|
||||
Each deque entry stores:
|
||||
|
||||
| Field | Type | Purpose |
|
||||
|------------|-------|----------------------------------------------|
|
||||
| `position` | usize | 0-based start of this m-mer in the segment |
|
||||
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
|
||||
| `hash` | u64 | $H(\text{canonical})$ — ordering key for random minimizer selection |
|
||||
|
||||
The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
|
||||
|
||||
```rust
|
||||
fn hash_mmer(canonical: u64) -> u64 {
|
||||
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
|
||||
let x = x ^ (x >> 30);
|
||||
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
|
||||
let x = x ^ (x >> 27);
|
||||
let x = x.wrapping_mul(0x94d049bb133111eb);
|
||||
x ^ (x >> 31)
|
||||
}
|
||||
```
|
||||
|
||||
On each new nucleotide, once the window is full, the deque is updated:
|
||||
|
||||
!!! abstract "Algorithm — minimizer deque update"
|
||||
```text
|
||||
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
|
||||
-- pop dominated entries from the back
|
||||
while deque.back.hash ≥ hash:
|
||||
deque.pop_back()
|
||||
deque.push_back({position, canonical, hash})
|
||||
|
||||
-- evict expired entries from the front
|
||||
while deque.front.position + k < received:
|
||||
deque.pop_front()
|
||||
```
|
||||
|
||||
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
|
||||
|
||||
A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
|
||||
|
||||
## Kmer extraction
|
||||
|
||||
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
|
||||
|
||||
@@ -0,0 +1,73 @@
|
||||
# Minimizer selection
|
||||
|
||||
## Definition
|
||||
|
||||
A **minimizer** of a k-mer window is the m-mer (m < k) with the smallest value under some total order ≺ among all k − m + 1 overlapping m-mers in the window. The minimizer is always taken in **canonical form** (lexicographic minimum of forward and reverse complement) to ensure strand-independence.
|
||||
|
||||
The minimizer partitions the sequence into **super-kmers**: maximal contiguous runs of overlapping k-mers that share the same minimizer. A single minimizer anchors each super-kmer, enabling partitioned storage and indexing.
|
||||
|
||||
## Lexicographic ordering and its bias
|
||||
|
||||
The classical definition uses lexicographic order on the canonical m-mer value. In 2-bit encoding (A=00, C=01, G=10, T=11), the canonical form is $\min_{\text{lex}}(\text{fwd}, \text{rc})$, so AT-rich m-mers have systematically small values:
|
||||
|
||||
$$\text{canonical}(\text{AAAA}\cdots\text{A}) = \text{canonical}(\text{TTTT}\cdots\text{T}) = 0$$
|
||||
|
||||
Since small values always win the lex comparison, low-complexity AT-rich m-mers dominate as minimizers across large genomic regions. On real metagenomics data with k=31, m=11 and 256 partitions, this produces a max/min partition ratio of ≈ 2.75 — and a single pathological partition when the hash function has a fixed point at 0.
|
||||
|
||||
## Random minimizer
|
||||
|
||||
A **random minimizer** replaces lex order with a hash order: define $H : \{0,1\}^{2m} \to \{0,1\}^{64}$ and select the m-mer with the **minimum $H$ value** in the window.
|
||||
|
||||
The key property: because $H$ is a bijection with well-distributed outputs, each distinct m-mer in the window has equal probability of holding the minimum hash value. Selection probability is no longer correlated with nucleotide composition.
|
||||
|
||||
## Why the canonical form remains lexicographic
|
||||
|
||||
An apparent alternative is to redefine the canonical form of each m-mer as the strand with the smaller hash value:
|
||||
|
||||
$$\text{canonical}_H(v) = \arg\min(H(\text{fwd}),\ H(\text{rc}))$$
|
||||
|
||||
This must be rejected. The hash of this new canonical is $\min(H(\text{fwd}), H(\text{rc}))$ — the minimum of two i.i.d. Uniform$[0, 2^{64})$ values. Its distribution is:
|
||||
|
||||
$$F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^2$$
|
||||
|
||||
with density $f(x) = 2(1 - x/2^{64})$, which is approximately **twice as large near 0 than near $2^{64}$**. The low-order partition bits inherit this bias: partition 0 receives roughly twice as many super-kmers as the last partition.
|
||||
|
||||
The lex canonical form does not have this problem: $\text{canonical}_{\text{lex}}(v)$ is a fixed, deterministic representative of each equivalence class, and $H(\text{canonical}_{\text{lex}})$ is uniformly distributed over $[0, 2^{64})$ independently of the min/max relationship between the two strands.
|
||||
|
||||
## Partition key independence
|
||||
|
||||
A further subtlety arises when the selection hash is used directly as the partition key. The selected minimizer is the m-mer with the **minimum** $H$ value in a window of $W = k - m + 1$ positions. The minimum of $W$ i.i.d. Uniform$[0,2^{64})$ values has distribution:
|
||||
|
||||
$$F(x) = 1 - \left(1 - \frac{x}{2^{64}}\right)^W \approx \frac{Wx}{2^{64}}$$
|
||||
|
||||
concentrated near 0 relative to the full range. Using this minimum-hash directly as the partition key creates the same bias as lex ordering, just distributed differently.
|
||||
|
||||
The correct approach is to decouple selection from partition routing:
|
||||
|
||||
- **Selection** uses $H(\text{canonical}_{\text{lex}}(m\text{-mer}))$ to pick the minimizer in the window.
|
||||
- **Partition routing** recomputes $H(\text{canonical}_{\text{lex}}(\text{minimizer}))$ from the stored minimizer position. This is the hash of a specific kmer value, not the minimum of a window — it is uniformly distributed over $[0, 2^{64})$.
|
||||
|
||||
## Seed and fixed-point elimination
|
||||
|
||||
The splitmix64 finalizer has a fixed point at 0:
|
||||
|
||||
$$\text{mix64}(0) = 0$$
|
||||
|
||||
Since $\text{canonical}_{\text{lex}}(\text{AAAA}\cdots\text{A}) = 0$, using unseeded mix64 causes all-A m-mers to win every window comparison, recreating a pathological partition identical to the lex-ordering bias.
|
||||
|
||||
The fix is a non-zero XOR seed applied before mixing:
|
||||
|
||||
$$H(x) = \text{mix64}(x \oplus s), \quad s = \lfloor 2^{64}/\varphi \rfloor = \texttt{0x9e3779b97f4a7c15}$$
|
||||
|
||||
where $\varphi$ is the golden ratio. This maps 0 to $\text{mix64}(s)$, a well-distributed non-zero value. No canonical m-mer value has a systematically small $H$.
|
||||
|
||||
!!! abstract "Hash function $H$"
|
||||
```
|
||||
H(x):
|
||||
x ← x ⊕ 0x9e3779b97f4a7c15
|
||||
x ← x ⊕ (x >> 30)
|
||||
x ← x × 0xbf58476d1ce4e5b9
|
||||
x ← x ⊕ (x >> 27)
|
||||
x ← x × 0x94d049bb133111eb
|
||||
return x ⊕ (x >> 31)
|
||||
```
|
||||
Reference in New Issue
Block a user