obikmer/docmd/theory/encoding.md

# DNA encoding

## 2-bit nucleotide encoding

All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5′ end across all sequence types:

| Base | Encoding |
|------|----------|
| A    | `00`     |
| C    | `01`     |
| G    | `10`     |
| T    | `11`     |

The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.

## Kmer encoding

A kmer fits in a single `u64`. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.

Reverse complement is computed by **bit manipulation in four steps**, with no lookup table:

!!! abstract "Algorithm — Kmer reverse complement"
    ```text
    procedure KmerRevcomp(kmer, k):
        x ← ~kmer                                           -- complement all bases
        x ← swap_bytes(x)                                   -- reverse byte order
        x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
           | ((x & 0x0F0F0F0F0F0F0F0F) << 4)               -- swap nibbles within each byte
        x ← ((x >> 2) & 0x3333333333333333)
           | ((x & 0x3333333333333333) << 2)                -- swap 2-bit pairs within each nibble
        return x << (64 - 2*k)                              -- re-align to MSB
    ```

The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.

The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:

```
canonical(kmer) = min(kmer, revcomp(kmer))
```

This halves the kmer space and ensures strand-independent counting.