# DNA encoding ## 2-bit nucleotide encoding All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5′ end across all sequence types: | Base | Encoding | |------|----------| | A | `00` | | C | `01` | | G | `10` | | T | `11` | The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`. ## Kmer encoding A kmer fits in a single `u64`. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`. Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk. !!! abstract "Algorithm — Kmer reverse complement" ```text procedure KmerRevcomp(kmer, k): raw ← TABLE16[kmer & 0xFFFF] << 48 | TABLE16[(kmer >> 16) & 0xFFFF] << 32 | TABLE16[(kmer >> 32) & 0xFFFF] << 16 | TABLE16[(kmer >> 48) & 0xFFFF] return raw << (64 - 2*k) ``` The **canonical form** is the lexicographic minimum of the kmer and its reverse complement: ``` canonical(kmer) = min(kmer, revcomp(kmer)) ``` This halves the kmer space and ensures strand-independent counting.