5.5 KiB
Kmer entropy filter
Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.
Sub-word frequencies
For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the n_{\text{words}} = k - ws + 1 overlapping sub-words by sliding a window of length ws:
w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1
Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word and all cyclic rotations of its reverse complement. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let s_j be the size of equivalence class j (number of distinct raw words mapping to canonical form j), and f_j the count of canonical form j among the n_{\text{words}} sub-words (\sum_j f_j = n_{\text{words}}).
Corrected Shannon entropy
The circular equivalence classes have unequal sizes: under a uniform distribution over all 4^{ws} raw words, class j is visited with probability s_j / 4^{ws}, not 1/n_a. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.
The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class j equally among its s_j members:
H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j
The last term is the correction for unequal class sizes. For a uniformly random sequence (f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}), this gives H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2, the maximum entropy over raw words.
Maximum entropy correction for small samples
With only n_{\text{words}} observations over 4^{ws} possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over 4^{ws} categories.
Let c = \lfloor n_{\text{words}} / 4^{ws} \rfloor and r = n_{\text{words}} \bmod 4^{ws}. The most uniform integer distribution assigns frequency c+1 to r categories and c to the remaining 4^{ws} - r, with the convention 0 \log 0 = 0:
H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]
When n_{\text{words}} < 4^{ws}: c=0, r=n_{\text{words}}, and the formula reduces to H_{\max} = \log(n_{\text{words}}) — a single unified expression covers both regimes. A truly random sequence achieves H_{\text{corr}} \approx H_{\max}.
Normalized entropy
\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]
Final score
The filter computes \hat{H}(ws) for each word size ws from 1 to ws_max and returns the minimum:
\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \text{entropy}(kmer) \leq \theta, where \theta is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
Interpretation as an effective number of classes
H_{\text{corr}} is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: N_{\text{eff}} = e^{H_{\text{corr}}} is the number of equiprobable classes that would yield the same entropy.
For the normalised score \hat{H}, dividing by H_{\text{max}} changes the logarithm base:
\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}
The property is preserved: \hat{H} is the logarithm (in base N_{\text{max}}) of the effective number of equi-represented classes.
In the large-sample limit (n_{\text{words}} \gg 4^{ws}), N_{\text{max}} \approx 4^{ws}, giving:
N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}
This has a clean interpretation: ws \cdot \hat{H} is the effective word length (in bases) of a perfectly uniform distribution that would produce the same entropy. At \hat{H} = 1 the full space of 4^{ws} words is used; at \hat{H} = 0.5 with ws=2, only 4^1 = 4 effective classes out of 16 are occupied.
In our actual regime, n_{\text{words}} is small and 4^{ws} can exceed n_{\text{words}}, so H_{\text{max}} < \log(4^{ws}) due to the small-sample correction. The exact effective count is N_{\text{max}}^{\hat{H}}, not 4^{ws \cdot \hat{H}}.
Properties
The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:
- Orientation invariance:
\text{entropy}(K) = \text{entropy}(\text{revcomp}(K)), guaranteed by the strand-symmetric canonical form. - Context independence: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.