refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
@@ -17,18 +17,22 @@ The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complemen

 A kmer fits in a single `u64`. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.

-Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
+Reverse complement is computed by **bit manipulation in four steps**, with no lookup table:

 !!! abstract "Algorithm — Kmer reverse complement"
    ```text
    procedure KmerRevcomp(kmer, k):
-        raw ←   TABLE16[kmer & 0xFFFF]         << 48
-              | TABLE16[(kmer >> 16) & 0xFFFF] << 32
-              | TABLE16[(kmer >> 32) & 0xFFFF] << 16
-              | TABLE16[(kmer >> 48) & 0xFFFF]
-        return raw << (64 - 2*k)
+        x ← ~kmer                                           -- complement all bases
+        x ← swap_bytes(x)                                   -- reverse byte order
+        x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
+           | ((x & 0x0F0F0F0F0F0F0F0F) << 4)               -- swap nibbles within each byte
+        x ← ((x >> 2) & 0x3333333333333333)
+           | ((x & 0x3333333333333333) << 2)                -- swap 2-bit pairs within each nibble
+        return x << (64 - 2*k)                              -- re-align to MSB
    ```

+The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 64−2k padding bits.
+
 The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:

 ```
@@ -40,7 +40,7 @@ The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and ret

 $$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$

-A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) \leq \theta$, where $\theta$ is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
+A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) < \theta$, where $\theta$ is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.

 ## Interpretation as an effective number of classes