Files
obikmer/docmd/theory/encoding.md
T

39 lines
1.4 KiB
Markdown
Raw Normal View History

2026-04-16 22:38:20 +02:00
# DNA encoding
## 2-bit nucleotide encoding
All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5 end across all sequence types:
| Base | Encoding |
|------|----------|
| A | `00` |
| C | `01` |
| G | `10` |
| T | `11` |
The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.
## Kmer encoding
A kmer fits in a single `u64`. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
!!! abstract "Algorithm — Kmer reverse complement"
```text
procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer & 0xFFFF] << 48
| TABLE16[(kmer >> 16) & 0xFFFF] << 32
| TABLE16[(kmer >> 32) & 0xFFFF] << 16
| TABLE16[(kmer >> 48) & 0xFFFF]
return raw << (64 - 2*k)
```
The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
```
canonical(kmer) = min(kmer, revcomp(kmer))
```
This halves the kmer space and ensures strand-independent counting.