036d044291
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
43 lines
1.8 KiB
Markdown
43 lines
1.8 KiB
Markdown
# DNA encoding
|
||
|
||
## 2-bit nucleotide encoding
|
||
|
||
All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5′ end across all sequence types:
|
||
|
||
| Base | Encoding |
|
||
|------|----------|
|
||
| A | `00` |
|
||
| C | `01` |
|
||
| G | `10` |
|
||
| T | `11` |
|
||
|
||
The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.
|
||
|
||
## Kmer encoding
|
||
|
||
A kmer fits in a single `u64`. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
|
||
|
||
Reverse complement is computed by **bit manipulation in four steps**, with no lookup table:
|
||
|
||
!!! abstract "Algorithm — Kmer reverse complement"
|
||
```text
|
||
procedure KmerRevcomp(kmer, k):
|
||
x ← ~kmer -- complement all bases
|
||
x ← swap_bytes(x) -- reverse byte order
|
||
x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
|
||
| ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte
|
||
x ← ((x >> 2) & 0x3333333333333333)
|
||
| ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble
|
||
return x << (64 - 2*k) -- re-align to MSB
|
||
```
|
||
|
||
The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.
|
||
|
||
The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
|
||
|
||
```
|
||
canonical(kmer) = min(kmer, revcomp(kmer))
|
||
```
|
||
|
||
This halves the kmer space and ensures strand-independent counting.
|