58391886a3
- Add `hash` field to MmerItem for stable, randomized minimizer ordering - Introduce hash_mMER() using mix64 with XOR seed to avoid fixed points (e.g., poly-A/T) - Remove is_degenerate() and minimizer_worse(), simplifying comparison to hash-only - Update push logic: compare hashes instead of canonical values with degeneracy checks
159 lines
7.4 KiB
Markdown
159 lines
7.4 KiB
Markdown
# SuperKmer — implementation
|
||
|
||
## Memory layout
|
||
|
||
A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):
|
||
|
||
| Field | Bits | Role |
|
||
|-------|------|------|
|
||
| COUNT | 24 | Occurrence count (≤ 16 M) |
|
||
| SEQL | 8 | Sequence length in nucleotides (1–256) |
|
||
|
||
Bit layout (MSB to LSB): `[31:8] COUNT [7:0] SEQL`
|
||
|
||
SEQL is stored as a raw `u8`: values 1–255 represent lengths 1–255; **0 represents 256** (wrapping convention). The public accessor returns a `usize` and performs the conversion:
|
||
|
||
```rust
|
||
fn seql(&self) -> usize { if s == 0 { 256 } else { s as usize } }
|
||
fn count(&self) -> u32 { self.0 >> 8 }
|
||
fn increment(&mut self) { self.0 += 1 << 8; }
|
||
fn add(&mut self, n: u32) { self.0 += n << 8; }
|
||
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
|
||
```
|
||
|
||
The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k−1 overlap, preserving all kmers without duplication.
|
||
|
||
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
|
||
|
||
## ASCII encoding and decoding
|
||
|
||
Two lookup tables handle ASCII ↔ 2-bit conversion:
|
||
|
||
- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
|
||
- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
|
||
|
||
Encoding 4 nucleotides into one byte:
|
||
|
||
```rust
|
||
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
|
||
```
|
||
|
||
Decoding one byte into 4 ASCII characters:
|
||
|
||
```rust
|
||
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
|
||
```
|
||
|
||
## Reverse complement
|
||
|
||
The reverse complement is computed **in place** with zero allocation in two steps.
|
||
|
||
**Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
|
||
|
||
```rust
|
||
const fn revcomp4(x: u8) -> u8 {
|
||
let x = !x; // complement all bases
|
||
let x = (x >> 4) | (x << 4); // swap nibbles
|
||
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
|
||
x
|
||
}
|
||
```
|
||
|
||
`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
|
||
|
||
**Step 2 — realignment.** After step 1, `padding = n × 8 − SEQL × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
|
||
|
||
```rust
|
||
shift = n * 8 - SEQL * 2 // number of padding bits
|
||
bits.rotate_left(shift)
|
||
bits[len - shift..].fill(false)
|
||
```
|
||
|
||
`Msb0` ordering makes the bit layout hardware-independent.
|
||
|
||
!!! abstract "Algorithm — Super-kmer canonisation"
|
||
```text
|
||
procedure SuperKmerCanonical(seq, SEQL):
|
||
for i ← 0 to SEQL − 1:
|
||
fwd ← nucleotide(seq, i)
|
||
rev ← complement(nucleotide(seq, SEQL − 1 − i))
|
||
if fwd < rev: return seq -- forward is canonical
|
||
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
|
||
return seq -- palindrome: either orientation valid
|
||
```
|
||
|
||
## Minimizer sliding window
|
||
|
||
Super-kmers are built by `SuperKmerIter` (crate `obiskbuilder`), which maintains the current minimizer with a **monotonic deque** over a sliding window of W = k − m + 1 m-mer positions.
|
||
|
||
Each deque entry stores:
|
||
|
||
| Field | Type | Purpose |
|
||
|------------|-------|----------------------------------------------|
|
||
| `position` | usize | 0-based start of this m-mer in the segment |
|
||
| `canonical`| u64 | right-aligned canonical m-mer value (lex-min of fwd and rc); used as partition key |
|
||
| `hash` | u64 | $H(\text{canonical})$ — ordering key for random minimizer selection |
|
||
|
||
The hash $H$ is the seeded splitmix64 finalizer (see [Minimizer selection](../theory/minimizer.md)):
|
||
|
||
```rust
|
||
fn hash_mmer(canonical: u64) -> u64 {
|
||
let x = canonical ^ 0x9e3779b97f4a7c15; // seed: eliminates fixed point at 0
|
||
let x = x ^ (x >> 30);
|
||
let x = x.wrapping_mul(0xbf58476d1ce4e5b9);
|
||
let x = x ^ (x >> 27);
|
||
let x = x.wrapping_mul(0x94d049bb133111eb);
|
||
x ^ (x >> 31)
|
||
}
|
||
```
|
||
|
||
On each new nucleotide, once the window is full, the deque is updated:
|
||
|
||
!!! abstract "Algorithm — minimizer deque update"
|
||
```text
|
||
procedure UpdateMinimizer(deque, position, canonical, hash, k, received):
|
||
-- pop dominated entries from the back
|
||
while deque.back.hash ≥ hash:
|
||
deque.pop_back()
|
||
deque.push_back({position, canonical, hash})
|
||
|
||
-- evict expired entries from the front
|
||
while deque.front.position + k < received:
|
||
deque.pop_front()
|
||
```
|
||
|
||
The front of the deque is always the current minimizer. Because the deque is maintained in strictly increasing hash order, each entry is popped at most once — O(1) amortized per nucleotide.
|
||
|
||
A super-kmer boundary is emitted when the minimizer changes: `deque.front.hash ≠ prev_hash`. The `canonical` field of the front entry is **not** used for boundary detection — that uses the hash alone. The canonical value is stored so that the partition key $H(\text{canonical})$ can be recomputed independently at routing time from the stored `minimizer_pos`, without inheriting the minimum-order-statistic bias (see [Minimizer selection — partition key independence](../theory/minimizer.md#partition-key-independence)).
|
||
|
||
## Kmer extraction
|
||
|
||
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
|
||
|
||
```rust
|
||
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
|
||
```
|
||
|
||
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
|
||
|
||
---
|
||
|
||
!!! abstract "Algorithm — Super-kmer reverse complement"
|
||
```text
|
||
procedure SuperKmerRevcomp(seq, SEQL):
|
||
n ← ⌈SEQL / 4⌉ -- number of bytes
|
||
shift ← n × 8 − SEQL × 2 -- padding bits to flush
|
||
|
||
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
|
||
lo ← 0 ; hi ← n − 1
|
||
while lo < hi:
|
||
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
|
||
lo ← lo + 1 ; hi ← hi − 1
|
||
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
|
||
|
||
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
|
||
if shift > 0:
|
||
bits.rotate_left(shift)
|
||
bits[n×8 − shift .. n×8].fill(0)
|
||
```
|