Files
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

1.8 KiB
Raw Permalink Blame History

DNA encoding

2-bit nucleotide encoding

All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5 end across all sequence types:

Base Encoding
A 00
C 01
G 10
T 11

The Watson-Crick complement of any base is its bitwise NOT on 2 bits: complement(base) = ~base & 0b11.

Kmer encoding

A kmer fits in a single u64. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): (kmer >> (62 - 2*i)) & 0b11.

Reverse complement is computed by bit manipulation in four steps, with no lookup table:

!!! abstract "Algorithm — Kmer reverse complement" text procedure KmerRevcomp(kmer, k): x ← ~kmer -- complement all bases x ← swap_bytes(x) -- reverse byte order x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte x ← ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble return x << (64 - 2*k) -- re-align to MSB

The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.

The canonical form is the lexicographic minimum of the kmer and its reverse complement:

canonical(kmer) = min(kmer, revcomp(kmer))

This halves the kmer space and ensures strand-independent counting.