first implementation but far to be optimal

2026-04-16 22:38:20 +02:00
commit de3f9b16cf
19336 changed files with 380276 additions and 0 deletions
@@ -0,0 +1,38 @@
+# DNA encoding
+
+## 2-bit nucleotide encoding
+
+All nucleotides are encoded on 2 bits, MSB-first within each word. Nucleotides are numbered 0-based from the 5′ end across all sequence types:
+
+| Base | Encoding |
+|------|----------|
+| A    | `00`     |
+| C    | `01`     |
+| G    | `10`     |
+| T    | `11`     |
+
+The Watson-Crick complement of any base is its bitwise NOT on 2 bits: `complement(base) = ~base & 0b11`.
+
+## Kmer encoding
+
+A kmer fits in a single `u64`. Nucleotide 0 occupies bits 63–62, nucleotide i occupies bits 63−2i and 62−2i, and the low 64−2k bits are zero. Extraction of nucleotide i (0 ≤ i < k): `(kmer >> (62 - 2*i)) & 0b11`.
+
+Reverse complement is computed via a **16-bit lookup table** (65 536 entries × 2 bytes = 128 KB, fits in L2 cache) storing the reverse-complement of every 8-base chunk.
+
+!!! abstract "Algorithm — Kmer reverse complement"
+    ```text
+    procedure KmerRevcomp(kmer, k):
+        raw ←   TABLE16[kmer & 0xFFFF]         << 48
+              | TABLE16[(kmer >> 16) & 0xFFFF] << 32
+              | TABLE16[(kmer >> 32) & 0xFFFF] << 16
+              | TABLE16[(kmer >> 48) & 0xFFFF]
+        return raw << (64 - 2*k)
+    ```
+
+The **canonical form** is the lexicographic minimum of the kmer and its reverse complement:
+
+```
+canonical(kmer) = min(kmer, revcomp(kmer))
+```
+
+This halves the kmer space and ensures strand-independent counting.
@@ -0,0 +1,68 @@
+# Kmer entropy filter
+
+Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a **normalized Shannon entropy** over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.
+
+## Sub-word frequencies
+
+For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the $n_{\text{words}} = k - ws + 1$ overlapping sub-words by sliding a window of length ws:
+
+$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$
+
+Each sub-word is mapped to its **circular canonical form**: the lexicographic minimum among all cyclic rotations of the word **and all cyclic rotations of its reverse complement**. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let $s_j$ be the size of equivalence class $j$ (number of distinct raw words mapping to canonical form $j$), and $f_j$ the count of canonical form $j$ among the $n_{\text{words}}$ sub-words ($\sum_j f_j = n_{\text{words}}$).
+
+## Corrected Shannon entropy
+
+The circular equivalence classes have unequal sizes: under a uniform distribution over all $4^{ws}$ raw words, class $j$ is visited with probability $s_j / 4^{ws}$, not $1/n_a$. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.
+
+The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class $j$ equally among its $s_j$ members:
+
+$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$
+
+The last term is the correction for unequal class sizes. For a uniformly random sequence ($f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}$), this gives $H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2$, the maximum entropy over raw words.
+
+## Maximum entropy correction for small samples
+
+With only $n_{\text{words}}$ observations over $4^{ws}$ possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over $4^{ws}$ categories.
+
+Let $c = \lfloor n_{\text{words}} / 4^{ws} \rfloor$ and $r = n_{\text{words}} \bmod 4^{ws}$. The most uniform integer distribution assigns frequency $c+1$ to $r$ categories and $c$ to the remaining $4^{ws} - r$, with the convention $0 \log 0 = 0$:
+
+$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$
+
+When $n_{\text{words}} < 4^{ws}$: $c=0$, $r=n_{\text{words}}$, and the formula reduces to $H_{\max} = \log(n_{\text{words}})$ — a single unified expression covers both regimes. A truly random sequence achieves $H_{\text{corr}} \approx H_{\max}$.
+
+## Normalized entropy
+
+$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$
+
+## Final score
+
+The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and returns the **minimum**:
+
+$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$
+
+A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) \leq \theta$, where $\theta$ is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
+
+## Interpretation as an effective number of classes
+
+$H_{\text{corr}}$ is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: $N_{\text{eff}} = e^{H_{\text{corr}}}$ is the number of equiprobable classes that would yield the same entropy.
+
+For the normalised score $\hat{H}$, dividing by $H_{\text{max}}$ changes the logarithm base:
+
+$$\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}$$
+
+The property is preserved: $\hat{H}$ is the logarithm (in base $N_{\text{max}}$) of the effective number of equi-represented classes.
+
+In the large-sample limit ($n_{\text{words}} \gg 4^{ws}$), $N_{\text{max}} \approx 4^{ws}$, giving:
+
+$$N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}$$
+
+This has a clean interpretation: $ws \cdot \hat{H}$ is the **effective word length** (in bases) of a perfectly uniform distribution that would produce the same entropy. At $\hat{H} = 1$ the full space of $4^{ws}$ words is used; at $\hat{H} = 0.5$ with ws=2, only $4^1 = 4$ effective classes out of 16 are occupied.
+
+In our actual regime, $n_{\text{words}}$ is small and $4^{ws}$ can exceed $n_{\text{words}}$, so $H_{\text{max}} < \log(4^{ws})$ due to the small-sample correction. The exact effective count is $N_{\text{max}}^{\hat{H}}$, not $4^{ws \cdot \hat{H}}$.
+
+## Properties
+
+The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:
+
+- **Orientation invariance**: $\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))$, guaranteed by the strand-symmetric canonical form.
+- **Context independence**: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.
@@ -0,0 +1,28 @@
+# Partitioning and indexing architecture
+
+The canonical minimizer of a super-kmer is hashed to produce a **p-bit routing value** (p is a collection-level parameter):
+
+```
+canonical minimizer → hash(minimizer) → p-bit value → PART → partition directory
+```
+
+PART is computed once at phase 1 to open the correct partition file, then discarded. It is recomputed on the fly at query time. It is never stored in the super-kmer header.
+
+Each partition holds one MPHF instance (phase 6) that indexes kmers as plain u64 values — the minimizer plays no role inside the partition.
+
+## Why hashing is necessary
+
+The canonical minimizer is an m-mer (m ∈ {9, 11, 13, 15}), encoded in 2m bits (18 to 30 bits). Its distribution over the $4^m$ possible values is **not uniform**: because the minimizer is the lexicographic minimum of a window of m-mers, small values are systematically over-represented [@Zheng2020-ji; @Zheng2021-cc; @Pan2024-hb; @Kille2023-px; @Golan2025-xf]. Routing directly by the raw minimizer value would produce severely unbalanced partitions.
+
+A hash function with good avalanche properties redistributes this skewed distribution uniformly over the $2^p$ partition slots. The key reason this works well is the **entropy gap**: p is chosen to be much smaller than 2m, so the hash compresses many distinct minimizer values into each partition slot. Even under strong bias in the minimizer distribution, as long as its effective entropy exceeds p bits — which holds comfortably since the set of distinct minimizers in any real dataset is far larger than $2^p$ — the load imbalance across partitions is negligible.
+
+## Parameter choices
+
+| m  | 2m (bits) | Typical p | Partitions |
+|----|-----------|-----------|------------|
+| 9  | 18        | 6–8       | 64–256     |
+| 11 | 22        | 8–10      | 256–1 024  |
+| 13 | 26        | 10–12     | 1 024–4 096|
+| 15 | 30        | 10–14     | 1 024–16 384|
+
+The hard constraint is p ≤ 2m: one cannot extract more bits of uniform randomness from a source than it contains. In practice p is chosen well below 2m, leaving a large entropy margin that absorbs the minimizer bias. For k=31, m=13, p=10: 1 024 partitions with comfortable balance.
@@ -0,0 +1,32 @@
+# Kmers and super-kmers
+
+## Kmers
+
+A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
+
+- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
+- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
+
+## Super-kmers
+
+A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides. Each kmer of the run carries the same **canonical minimizer**. The **canonical minimizer** of a kmer is the smallest value of `min(m-mer, revcomp(m-mer))` over all m-mers within the kmer (m < k, m odd).
+
+### Canonical super-kmers
+
+A **canonical super-kmer** is the lexicographic minimum of a super-kmer and its reverse complement:
+
+```
+canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
+```
+
+When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.
+
+### Expected length of a super-kmer
+
+For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:
+
+$$L_{\text{nt}} = \frac{k-m+2}{2} + k - 1$$
+
+For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.[^superkmer_length]
+
+[^superkmer_length]: The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf].