Files
obikmer/docmd/implementation/kmer.md
T

92 lines
3.9 KiB
Markdown
Raw Normal View History

2026-04-16 22:38:20 +02:00
# Kmer — implementation
## Types and layout
2026-04-16 22:38:20 +02:00
`KmerOf<L>` is a `#[repr(transparent)]` newtype over `u64` parameterized by a `KmerLength` marker:
2026-04-16 22:38:20 +02:00
```rust
#[repr(transparent)]
pub struct KmerOf<L: KmerLength>(u64, PhantomData<L>);
2026-04-16 22:38:20 +02:00
```
Three marker types implement `KmerLength`:
| Marker | `len()` source | Used for |
|--------|---------------|---------|
| `KLen` | `params::k()` | k-mers |
| `MLen` | `params::m()` | minimizers |
| `ConstLen<N>` | const generic `N` | tests |
Public aliases:
```rust
pub type Kmer = KmerOf<KLen>; // k-mer, global k
pub type Minimizer = CanonicalKmerOf<MLen>; // canonical m-mer, global m
```
Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 6362; nucleotide i occupies bits 632i and 622i. The low 642·len bits are always zero. The length is **not stored** — every operation reads it from `L::len()`.
2026-04-16 22:38:20 +02:00
| 6362 | 6160 | … | 632(k1)1 to 632(k1) | 632k down to 0 |
|-------|-------|---|--------------------------|-----------------|
| nt 0 | nt 1 | … | nt k1 | zero padding |
## Global parameters
`params::set_k(k)` / `params::k()` and `params::set_m(m)` / `params::m()` are backed by `OnceLock<usize>` in production (write-once, panic on conflict) and by `thread_local! { Cell<usize> }` in test builds (per-thread, freely writable). `params::init(k, m)` sets both in one call.
2026-04-16 22:38:20 +02:00
## Encoding
`KmerOf::<L>::from_ascii(ascii)` encodes the first `L::len()` bytes using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
2026-04-16 22:38:20 +02:00
```rust
for i in 0..k {
val = (val << 2) | encode_base(ascii[i]) as u64;
}
KmerOf(val << (64 - 2 * k), PhantomData)
2026-04-16 22:38:20 +02:00
```
Zero allocation — result lives on the stack.
## Decoding
`write_ascii(writer)` writes k ASCII characters to any `W: Write` using the shared `DEC4` table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.
2026-04-16 22:38:20 +02:00
`to_ascii()` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
2026-04-16 22:38:20 +02:00
## Reverse complement
Computed as pure arithmetic — no lookup table, no memory access:
```rust
let x = !self.0; // complement
let x = x.swap_bytes(); // reverse bytes
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups
KmerOf(x << (64 - 2 * k), PhantomData)
2026-04-16 22:38:20 +02:00
```
After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
## Canonical form and `CanonicalKmerOf`
`canonical()` returns a `CanonicalKmerOf<L>` — a distinct newtype that carries the same `u64` layout but enforces the invariant that the stored value equals `min(kmer, revcomp)`:
2026-04-16 22:38:20 +02:00
```rust
pub fn canonical(&self) -> CanonicalKmerOf<L> {
let rc = self.revcomp();
CanonicalKmerOf(if self.0 <= rc.0 { self.0 } else { rc.0 }, PhantomData)
2026-04-16 22:38:20 +02:00
}
```
Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
`CanonicalKmerOf::from_raw_unchecked(raw)` is the only other public constructor, for trusted paths such as deserialisation.
## Sliding window helpers
`push_right(nuc)` / `push_left(nuc)` shift the window by one base in O(1). `is_overlapping(other)` checks whether the last k1 nucleotides of `self` equal the first k1 of `other`.
## Hashing
`hash_kmer(raw: u64) -> u64` computes `mix64(raw ^ 0x9e3779b97f4a7c15)`, the seeded splitmix64 finalizer. `CanonicalKmerOf::seq_hash()` delegates to `hash_kmer`.