refactor: update core types and add approximate evidence support

Refactor `Kmer`, `SuperKmer`, and chunk reader into optimized, generic representations with compile-time length parameters and bitwise operations. Update the pipeline and scheduler to support batch processing, 1→N flat transformations, and multi-source merging. Introduce an approximate evidence mode using b-bit fingerprints and `.idx` files, alongside existing exact mode. Update CLI documentation, minimizer selection, and query output schema accordingly.
This commit is contained in:
Eric Coissac
2026-05-26 09:12:41 +02:00
parent 88365e444c
commit 036d044291
13 changed files with 488 additions and 216 deletions
+10 -6
View File
@@ -17,18 +17,22 @@ The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complemen
A kmer fits in a single `u64`. Nucleotide 0 occupies bits 6362, nucleotide i occupies bits 632i and 622i, and the low 642k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
Reverse complement is computed by **bit manipulation in four steps**, with no lookup table:
!!! abstract "Algorithm — Kmer reverse complement"
```text
procedure KmerRevcomp(kmer, k):
raw ← TABLE16[kmer & 0xFFFF] << 48
| TABLE16[(kmer >> 16) & 0xFFFF] << 32
| TABLE16[(kmer >> 32) & 0xFFFF] << 16
| TABLE16[(kmer >> 48) & 0xFFFF]
return raw << (64 - 2*k)
x ← ~kmer -- complement all bases
x ← swap_bytes(x) -- reverse byte order
x ← ((x >> 4) & 0x0F0F0F0F0F0F0F0F)
| ((x & 0x0F0F0F0F0F0F0F0F) << 4) -- swap nibbles within each byte
x ← ((x >> 2) & 0x3333333333333333)
| ((x & 0x3333333333333333) << 2) -- swap 2-bit pairs within each nibble
return x << (64 - 2*k) -- re-align to MSB
```
The three reorder passes together reverse the order of all 2-bit base codes across the 64-bit word. The bitwise NOT in the first step complements each base (A↔T, C↔G). The final left shift clears the low 642k padding bits.
The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
```
+1 -1
View File
@@ -40,7 +40,7 @@ The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and ret
$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) \leq \theta$, where $\theta$ is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) < \theta$, where $\theta$ is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
## Interpretation as an effective number of classes