refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -1,38 +1,57 @@
 # Kmer — implementation

-## Memory layout
+## Types and layout

-`Kmer` is a `#[repr(transparent)]` newtype over `u64`:
+`KmerOf<L>` is a `#[repr(transparent)]` newtype over `u64` parameterized by a `KmerLength` marker:

 ```rust
 #[repr(transparent)]
-pub struct Kmer(u64);
+pub struct KmerOf<L: KmerLength>(u64, PhantomData<L>);
 ```

-Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2k bits are always zero. k is **not stored** — it is a parameter of every operation that needs it, and will be owned by the future collection-level indexer.
+Three marker types implement `KmerLength`:
+
+| Marker | `len()` source | Used for |
+|--------|---------------|---------|
+| `KLen` | `params::k()` | k-mers |
+| `MLen` | `params::m()` | minimizers |
+| `ConstLen<N>` | const generic `N` | tests |
+
+Public aliases:
+
+```rust
+pub type Kmer     = KmerOf<KLen>;         // k-mer, global k
+pub type Minimizer = CanonicalKmerOf<MLen>; // canonical m-mer, global m
+```
+
+Nucleotides are packed 2 bits each, **left-aligned**, MSB-first. Nucleotide 0 occupies bits 63–62; nucleotide i occupies bits 63−2i and 62−2i. The low 64−2·len bits are always zero. The length is **not stored** — every operation reads it from `L::len()`.

 | 63–62 | 61–60 | … | 63−2(k−1)−1 to 63−2(k−1) | 63−2k down to 0 |
 |-------|-------|---|--------------------------|-----------------|
 | nt 0  | nt 1  | … | nt k−1                   | zero padding    |

+## Global parameters
+
+`params::set_k(k)` / `params::k()` and `params::set_m(m)` / `params::m()` are backed by `OnceLock<usize>` in production (write-once, panic on conflict) and by `thread_local! { Cell<usize> }` in test builds (per-thread, freely writable). `params::init(k, m)` sets both in one call.
+
 ## Encoding

-`Kmer::from_ascii(ascii, k)` encodes the first k bytes of an ASCII slice using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):
+`KmerOf::<L>::from_ascii(ascii)` encodes the first `L::len()` bytes using the shared `ENC` table (see [SuperKmer — ASCII encoding](superkmer.md#ascii-encoding-and-decoding)):

 ```rust
 for i in 0..k {
    val = (val << 2) | encode_base(ascii[i]) as u64;
 }
-Kmer(val << (64 - 2 * k))
+KmerOf(val << (64 - 2 * k), PhantomData)
 ```

 Zero allocation — result lives on the stack.

 ## Decoding

-`write_ascii(k, buf)` appends k ASCII characters to a caller-supplied `Vec<u8>` using the shared `DEC4` table: one lookup per 4 nucleotides, two partial-byte lookups for the remainder. No allocation in the hot path.
+`write_ascii(writer)` writes k ASCII characters to any `W: Write` using the shared `DEC4` table: one lookup per 4 nucleotides, one partial lookup for the remainder. No allocation in the hot path.

-`to_ascii(k)` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.
+`to_ascii()` is a convenience wrapper that allocates and returns a `Vec<u8>`; intended for tests and display only.

 ## Reverse complement

@@ -43,18 +62,30 @@ let x = !self.0;                                                               /
 let x = x.swap_bytes();                                                        // reverse bytes
 let x = ((x >> 4) & 0x0F0F0F0F0F0F0F0F) | ((x & 0x0F0F0F0F0F0F0F0F) << 4); // swap nibbles
 let x = ((x >> 2) & 0x3333333333333333) | ((x & 0x3333333333333333) << 2);   // swap 2-bit groups
-Kmer(x << (64 - 2 * k))
+KmerOf(x << (64 - 2 * k), PhantomData)
 ```

 After complementing, bytes are reversed (`swap_bytes`), then nibbles, then 2-bit groups — restoring 2-bit nucleotides to their correct positions in reverse order. A final left-shift realigns to MSB. Zero allocation — result lives on the stack.

-## Canonical form
+## Canonical form and `CanonicalKmerOf`
+
+`canonical()` returns a `CanonicalKmerOf<L>` — a distinct newtype that carries the same `u64` layout but enforces the invariant that the stored value equals `min(kmer, revcomp)`:

 ```rust
-pub fn canonical(&self, k: usize) -> Self {
-    let rc = self.revcomp(k);
-    if self.0 <= rc.0 { *self } else { rc }
+pub fn canonical(&self) -> CanonicalKmerOf<L> {
+    let rc = self.revcomp();
+    CanonicalKmerOf(if self.0 <= rc.0 { self.0 } else { rc.0 }, PhantomData)
 }
 ```

 Lexicographic minimum of forward and reverse-complement, comparing the raw `u64` values directly (left-aligned encoding makes this equivalent to nucleotide-wise comparison). Zero allocation — result lives on the stack.
+
+`CanonicalKmerOf::from_raw_unchecked(raw)` is the only other public constructor, for trusted paths such as deserialisation.
+
+## Sliding window helpers
+
+`push_right(nuc)` / `push_left(nuc)` shift the window by one base in O(1). `is_overlapping(other)` checks whether the last k−1 nucleotides of `self` equal the first k−1 of `other`.
+
+## Hashing
+
+`hash_kmer(raw: u64) -> u64` computes `mix64(raw ^ 0x9e3779b97f4a7c15)`, the seeded splitmix64 finalizer. `CanonicalKmerOf::seq_hash()` delegates to `hash_kmer`.