Files
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

43 lines
1.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DNA encoding
## 2-bit nucleotide encoding
All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5 end across all sequence types:
| Base | Encoding |
|------|----------|
| A | `00` |
| C | `01` |
| G | `10` |
| T | `11` |
The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.
## Kmer encoding
A kmer fits in a single `u64`. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
Reverse complement is computed by **bit manipulation in four steps**, with no lookup table:
!!! abstract "Algorithm — Kmer reverse complement"
```text
procedure KmerRevcomp(kmer, k):
x ← ~kmer -- complement all bases
x ← swap_bytes(x) -- reverse byte order
x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
| ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte
x ← ((x >> 2) & 0x3333333333333333)
| ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble
return x << (64 - 2*k) -- re-align to MSB
```
The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.
The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
```
canonical(kmer) = min(kmer, revcomp(kmer))
```
This halves the kmer space and ensures strand-independent counting.