Files
obikmer/docmd/theory/encoding.md
T
2026-04-19 12:17:16 +02:00

39 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DNA encoding
## 2-bit nucleotide encoding
All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5 end across all sequence types:
| Base | Encoding |
|------|----------|
| A | `00` |
| C | `01` |
| G | `10` |
| T | `11` |
The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.
## Kmer encoding
A kmer fits in a single `u64`. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
!!! abstract "Algorithm — Kmer reverse complement"
```text
procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer & 0xFFFF] << 48
| TABLE16[(kmer >> 16) & 0xFFFF] << 32
| TABLE16[(kmer >> 32) & 0xFFFF] << 16
| TABLE16[(kmer >> 48) & 0xFFFF]
return raw << (64 - 2*k)
```
The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
```
canonical(kmer) = min(kmer, revcomp(kmer))
```
This halves the kmer space and ensures strand-independent counting.