Files
obikmer/docmd/implementation/kmer.md
T
Eric Coissac 036d044291 refactor: update core types and add approximate evidence support
Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 10:04:25 +02:00

92 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kmer — implementation
## Types and layout
`KmerOf<L>` is a `#[repr(transparent)]` newtype over `u64` parameterized by a `KmerLength` marker:
```rust
#[repr(transparent)]
pub struct KmerOf<L: KmerLength>(u64, PhantomData<L>);
```
Three marker types implement `KmerLength`:
| Marker | `len()` source | Used for |
|--------|---------------|---------|
| `KLen` | `params::k()` | k-mers |
| `MLen` | `params::m()` | minimizers |
| `ConstLen<N>` | const generic `N` | tests |
Public aliases:
```rust
pub type Kmer = KmerOf<KLen>; // k-mer, global k
pub type Minimizer = CanonicalKmerOf<MLen>; // canonical m-mer, global m
```
Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 6362; nucleotide i occupies bits 632i and 622i. The low 642·len bits are always zero. The length is **not stored** — every operation reads it from `L::len()`.
| 6362 | 6160 | … | 632(k1)1 to 632(k1) | 632k down to 0 |
|-------|-------|---|--------------------------|-----------------|
| nt 0 | nt 1 | … | nt k1 | zero padding |
## Global parameters
`params::set_k(k)` / `params::k()` and `params::set_m(m)` / `params::m()` are backed by `OnceLock<usize>` in production (write-once, panic on conflict) and by `thread_local! { Cell<usize> }` in test builds (per-thread, freely writable). `params::init(k, m)` sets both in one call.
## Encoding
`KmerOf::<L>::from_ascii(ascii)` encodes the first `L::len()` bytes using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
```rust
for i in 0..k {
val = (val << 2) | encode_base(ascii[i]) as u64;
}
KmerOf(val << (64 - 2 * k), PhantomData)
```
Zero allocation — result lives on the stack.
## Decoding
`write_ascii(writer)` writes k ASCII characters to any `W: Write` using the shared `DEC4` table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.
`to_ascii()` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
## Reverse complement
Computed as pure arithmetic — no lookup table, no memory access:
```rust
let x = !self.0; // complement
let x = x.swap_bytes(); // reverse bytes
let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2); // swap 2-bit groups
KmerOf(x << (64 - 2 * k), PhantomData)
```
After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.
## Canonical form and `CanonicalKmerOf`
`canonical()` returns a `CanonicalKmerOf<L>` — a distinct newtype that carries the same `u64` layout but enforces the invariant that the stored value equals `min(kmer, revcomp)`:
```rust
pub fn canonical(&self) -> CanonicalKmerOf<L> {
let rc = self.revcomp();
CanonicalKmerOf(if self.0 <= rc.0 { self.0 } else { rc.0 }, PhantomData)
}
```
Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
`CanonicalKmerOf::from_raw_unchecked(raw)` is the only other public constructor, for trusted paths such as deserialisation.
## Sliding window helpers
`push_right(nuc)` / `push_left(nuc)` shift the window by one base in O(1). `is_overlapping(other)` checks whether the last k1 nucleotides of `self` equal the first k1 of `other`.
## Hashing
`hash_kmer(raw: u64) -> u64` computes `mix64(raw ^ 0x9e3779b97f4a7c15)`, the seeded splitmix64 finalizer. `CanonicalKmerOf::seq_hash()` delegates to `hash_kmer`.