Files
obikmer/docmd/implementation/superkmer.md
T
2026-04-19 12:17:16 +02:00

115 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SuperKmer — implementation
## Memory layout
A super-kmer is stored as a **32-bit header** followed by a **byte-aligned nucleotide sequence** (2 bits/base, nucleotide 0 at the MSB of the first byte, max 256 nt):
| Field | Bits | Role |
|-------|------|------|
| COUNT | 24 | Occurrence count (≤ 16 M) |
| SEQL | 8 | Sequence length in nucleotides (1256) |
Bit layout (MSB to LSB): `[31:8] COUNT [7:0] SEQL`
SEQL is stored as a raw `u8`: values 1255 represent lengths 1255; **0 represents 256** (wrapping convention). The public accessor returns a `usize` and performs the conversion:
```rust
fn seql(&self) -> usize { if s == 0 { 256 } else { s as usize } }
fn count(&self) -> u32 { self.0 >> 8 }
fn increment(&mut self) { self.0 += 1 << 8; }
fn add(&mut self, n: u32) { self.0 += n << 8; }
fn set_count(&mut self, n: u32) { self.0 = (self.0 & 0xFF) | (n << 8); }
```
The SEQL field is 8 bits, capping the stored sequence at 256 nt. Given the expected length of ~40 nt, this cap is almost never reached; when it is, the super-kmer is split at 256 nt with a k1 overlap, preserving all kmers without duplication.
The sequence is always stored in canonical form (lexicographic minimum of forward and reverse complement), with nucleotide 0 at the MSB of the first byte. The byte array can be hashed directly without any adjustment.
## ASCII encoding and decoding
Two lookup tables handle ASCII ↔ 2-bit conversion:
- **`ENC: [u8; 32]`** — indexed by `b & 0x1F` (lower 5 bits of the ASCII byte). Maps A/a→0, C/c→1, G/g→2, T/t and U/u→3; ambiguous bases and unknowns silently map to 0 (A). 32 entries, fits entirely in L1 cache. Upper- and lowercase are handled identically.
- **`DEC4: [u32; 256]`** — maps a packed byte (4 nucleotides) to 4 ASCII characters packed as a big-endian `u32`. 1 KB total, fits in L1 cache. One lookup per output byte yields 4 decoded characters.
Encoding 4 nucleotides into one byte:
```rust
byte = ENC[c0 & 0x1F] << 6 | ENC[c1 & 0x1F] << 4 | ENC[c2 & 0x1F] << 2 | ENC[c3 & 0x1F]
```
Decoding one byte into 4 ASCII characters:
```rust
DEC4[byte].to_be_bytes() // [nuc0, nuc1, nuc2, nuc3] in ASCII
```
## Reverse complement
The reverse complement is computed **in place** with zero allocation in two steps.
**Step 1 — byte swap with `REVCOMP4`.** A 256-byte lookup table `REVCOMP4` maps each byte (4 nucleotides) to its reverse complement. Bytes are swapped from the outside in, applying `REVCOMP4` to each:
```rust
const fn revcomp4(x: u8) -> u8 {
let x = !x; // complement all bases
let x = (x >> 4) | (x << 4); // swap nibbles
let x = ((x >> 2) & 0x33) | ((x & 0x33) << 2); // swap 2-bit groups
x
}
```
`REVCOMP4` is 256 bytes (fits in L1 cache), computed at compile time. No endianness dependency — all operations are pure arithmetic on byte values.
**Step 2 — realignment.** After step 1, `padding = n × 8 SEQL × 2` spurious bits (complements of the original padding A's) appear at the start of the array. They are flushed left using `BitSlice<u8, Msb0>::rotate_left(padding)` from the `bitvec` crate, which is SIMD-accelerated. The trailing `padding` bits are then zeroed:
```rust
shift = n * 8 - SEQL * 2 // number of padding bits
bits.rotate_left(shift)
bits[len - shift..].fill(false)
```
`Msb0` ordering makes the bit layout hardware-independent.
!!! abstract "Algorithm — Super-kmer canonisation"
```text
procedure SuperKmerCanonical(seq, SEQL):
for i ← 0 to SEQL 1:
fwd ← nucleotide(seq, i)
rev ← complement(nucleotide(seq, SEQL 1 i))
if fwd < rev: return seq -- forward is canonical
if fwd > rev: return SuperKmerRevcomp(seq, SEQL) -- revcomp is canonical
return seq -- palindrome: either orientation valid
```
## Kmer extraction
A k-mer is extracted from a super-kmer with `SuperKmer::kmer(i, k)`, which returns a `Kmer` — a left-aligned `u64` newtype (see [Kmer implementation](kmer.md)):
```rust
pub fn kmer(&self, i: usize, k: usize) -> Result<Kmer, KmerError>
```
The bit slice `seq[i*2 .. (i+k)*2]` (Msb0 order) is loaded as a big-endian `u64` via `bitvec::load_be`, then left-shifted to produce the canonical left-aligned layout. One call — no loop, no allocation.
---
!!! abstract "Algorithm — Super-kmer reverse complement"
```text
procedure SuperKmerRevcomp(seq, SEQL):
n ← ⌈SEQL / 4⌉ -- number of bytes
shift ← n × 8 SEQL × 2 -- padding bits to flush
-- step 1: swap bytes outside-in, applying REVCOMP4 to each (256-byte L1 table)
lo ← 0 ; hi ← n 1
while lo < hi:
seq[lo], seq[hi] ← REVCOMP4[seq[hi]], REVCOMP4[seq[lo]]
lo ← lo + 1 ; hi ← hi 1
if lo == hi: seq[lo] ← REVCOMP4[seq[lo]]
-- step 2: left-rotate entire bit array by shift, zero trailing bits (SIMD via bitvec)
if shift > 0:
bits.rotate_left(shift)
bits[n×8 shift .. n×8].fill(0)
```