61 lines
2.5 KiB
Markdown
61 lines
2.5 KiB
Markdown
# Kmer — implementation
|
||
|
||
## Memory layout
|
||
|
||
`Kmer` is a `#[repr(transparent)]` newtype over `u64`:
|
||
|
||
```rust
|
||
#[repr(transparent)]
|
||
pub struct Kmer(u64);
|
||
```
|
||
|
||
Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is **not stored** — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.
|
||
|
||
| 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 |
|
||
|-------|-------|---|--------------------------|-----------------|
|
||
| nt 0 | nt 1 | … | nt k−1 | zero padding |
|
||
|
||
## Encoding
|
||
|
||
`Kmer::from_ascii(ascii, k)` encodes the first k bytes of an ASCII slice using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
|
||
|
||
```rust
|
||
for i in 0..k {
|
||
val = (val << 2) | encode_base(ascii[i]) as u64;
|
||
}
|
||
Kmer(val << (64 - 2 * k))
|
||
```
|
||
|
||
Zero allocation — result lives on the stack.
|
||
|
||
## Decoding
|
||
|
||
`write_ascii(k, buf)` appends k ASCII characters to a caller-supplied `Vec<u8>` using the shared `DEC4` table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.
|
||
|
||
`to_ascii(k)` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
|
||
|
||
## Reverse complement
|
||
|
||
Computed as pure arithmetic — no lookup table, no memory access:
|
||
|
||
```rust
|
||
let x = !self.0; // complement
|
||
let x = x.swap_bytes(); // reverse bytes
|
||
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
|
||
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups
|
||
Kmer(x << (64 - 2 * k))
|
||
```
|
||
|
||
After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
|
||
|
||
## Canonical form
|
||
|
||
```rust
|
||
pub fn canonical(&self, k: usize) -> Self {
|
||
let rc = self.revcomp(k);
|
||
if self.0 <= rc.0 { *self } else { rc }
|
||
}
|
||
```
|
||
|
||
Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
|