Files
obikmer/docmd/theory/encoding.md
T
2026-04-19 12:17:16 +02:00

1.4 KiB
Raw Blame History

DNA encoding

2-bit nucleotide encoding

All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5 end across all sequence types:

Base Encoding
A 00
C 01
G 10
T 11

The Watson-Crick complement of any base is its bitwise NOT on 2 bits: complement(base) = ~base & 0b11.

Kmer encoding

A kmer fits in a single u64. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): (kmer >> (62 - 2*i)) & 0b11.

Reverse complement is computed via a 16-bit lookup table (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.

!!! abstract "Algorithm — Kmer reverse complement" text procedure KmerRevcomp(kmer, k): raw ← TABLE16[kmer & 0xFFFF] << 48 | TABLE16[(kmer >> 16) & 0xFFFF] << 32 | TABLE16[(kmer >> 32) & 0xFFFF] << 16 | TABLE16[(kmer >> 48) & 0xFFFF] return raw << (64 - 2*k)

The canonical form is the lexicographic minimum of the kmer and its reverse complement:

canonical(kmer) = min(kmer, revcomp(kmer))

This halves the kmer space and ensures strand-independent counting.