🔧 Replace degenerate minimizer logic with hash-based random ordering
- Add `hash` field to MmerItem for stable, randomized minimizer ordering - Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T) - Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only - Update push logic: compare hashes instead of canonical values with degeneracy checks
This commit is contained in:
@@ -82,6 +82,50 @@ bits[len - shift..].fill(false)
|
||||
return seq -- palindrome: either orientation valid
|
||||
```
|
||||
|
||||
## Minimizer sliding window
|
||||
|
||||
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k − m + 1 m-mer positions.
|
||||
|
||||
Each deque entry stores:
|
||||
|
||||
| Field | Type | Purpose |
|
||||
|------------|-------|----------------------------------------------|
|
||||
| `position` | usize | 0-based start of this m-mer in the segment |
|
||||
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
|
||||
| `hash` | u64 | $H(\text{canonical})$ — ordering key for random minimizer selection |
|
||||
|
||||
The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
|
||||
|
||||
```rust
|
||||
fn hash_mmer(canonical: u64) -> u64 {
|
||||
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
|
||||
let x = x ^ (x >> 30);
|
||||
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
|
||||
let x = x ^ (x >> 27);
|
||||
let x = x.wrapping_mul(0x94d049bb133111eb);
|
||||
x ^ (x >> 31)
|
||||
}
|
||||
```
|
||||
|
||||
On each new nucleotide, once the window is full, the deque is updated:
|
||||
|
||||
!!! abstract "Algorithm — minimizer deque update"
|
||||
```text
|
||||
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
|
||||
-- pop dominated entries from the back
|
||||
while deque.back.hash ≥ hash:
|
||||
deque.pop_back()
|
||||
deque.push_back({position, canonical, hash})
|
||||
|
||||
-- evict expired entries from the front
|
||||
while deque.front.position + k < received:
|
||||
deque.pop_front()
|
||||
```
|
||||
|
||||
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
|
||||
|
||||
A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
|
||||
|
||||
## Kmer extraction
|
||||
|
||||
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
|
||||
|
||||
Reference in New Issue
Block a user