Files

69 lines
5.5 KiB
Markdown
Raw Permalink Normal View History

2026-04-16 22:38:20 +02:00
# Kmer entropy filter
Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a **normalized Shannon entropy** over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.
## Sub-word frequencies
For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the $n_{\text{words}} = k - ws + 1$ overlapping sub-words by sliding a window of length ws:
$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$
Each sub-word is mapped to its **circular canonical form**: the lexicographic minimum among all cyclic rotations of the word **and all cyclic rotations of its reverse complement**. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let $s_j$ be the size of equivalence class $j$ (number of distinct raw words mapping to canonical form $j$), and $f_j$ the count of canonical form $j$ among the $n_{\text{words}}$ sub-words ($\sum_j f_j = n_{\text{words}}$).
## Corrected Shannon entropy
The circular equivalence classes have unequal sizes: under a uniform distribution over all $4^{ws}$ raw words, class $j$ is visited with probability $s_j / 4^{ws}$, not $1/n_a$. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.
The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class $j$ equally among its $s_j$ members:
$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$
The last term is the correction for unequal class sizes. For a uniformly random sequence ($f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}$), this gives $H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2$, the maximum entropy over raw words.
## Maximum entropy correction for small samples
With only $n_{\text{words}}$ observations over $4^{ws}$ possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over $4^{ws}$ categories.
Let $c = \lfloor n_{\text{words}} / 4^{ws} \rfloor$ and $r = n_{\text{words}} \bmod 4^{ws}$. The most uniform integer distribution assigns frequency $c+1$ to $r$ categories and $c$ to the remaining $4^{ws} - r$, with the convention $0 \log 0 = 0$:
$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$
When $n_{\text{words}} < 4^{ws}$: $c=0$, $r=n_{\text{words}}$, and the formula reduces to $H_{\max} = \log(n_{\text{words}})$ — a single unified expression covers both regimes. A truly random sequence achieves $H_{\text{corr}} \approx H_{\max}$.
## Normalized entropy
$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$
## Final score
The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and returns the **minimum**:
$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) < \theta$, where $\theta$ is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
2026-04-16 22:38:20 +02:00
## Interpretation as an effective number of classes
$H_{\text{corr}}$ is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: $N_{\text{eff}} = e^{H_{\text{corr}}}$ is the number of equiprobable classes that would yield the same entropy.
For the normalised score $\hat{H}$, dividing by $H_{\text{max}}$ changes the logarithm base:
$$\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}$$
The property is preserved: $\hat{H}$ is the logarithm (in base $N_{\text{max}}$) of the effective number of equi-represented classes.
In the large-sample limit ($n_{\text{words}} \gg 4^{ws}$), $N_{\text{max}} \approx 4^{ws}$, giving:
$$N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}$$
This has a clean interpretation: $ws \cdot \hat{H}$ is the **effective word length** (in bases) of a perfectly uniform distribution that would produce the same entropy. At $\hat{H} = 1$ the full space of $4^{ws}$ words is used; at $\hat{H} = 0.5$ with ws=2, only $4^1 = 4$ effective classes out of 16 are occupied.
In our actual regime, $n_{\text{words}}$ is small and $4^{ws}$ can exceed $n_{\text{words}}$, so $H_{\text{max}} < \log(4^{ws})$ due to the small-sample correction. The exact effective count is $N_{\text{max}}^{\hat{H}}$, not $4^{ws \cdot \hat{H}}$.
## Properties
The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:
- **Orientation invariance**: $\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))$, guaranteed by the strand-symmetric canonical form.
- **Context independence**: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.