Kmer entropy filter
Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.
Sub-word frequencies
For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the \(n_{\text{words}} = k - ws + 1\) overlapping sub-words by sliding a window of length ws:
Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word and all cyclic rotations of its reverse complement. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let \(s_j\) be the size of equivalence class \(j\) (number of distinct raw words mapping to canonical form \(j\)), and \(f_j\) the count of canonical form \(j\) among the \(n_{\text{words}}\) sub-words (\(\sum_j f_j = n_{\text{words}}\)).
Corrected Shannon entropy
The circular equivalence classes have unequal sizes: under a uniform distribution over all \(4^{ws}\) raw words, class \(j\) is visited with probability \(s_j / 4^{ws}\), not \(1/n_a\). Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.
The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class \(j\) equally among its \(s_j\) members:
The last term is the correction for unequal class sizes. For a uniformly random sequence (\(f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}\)), this gives \(H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2\), the maximum entropy over raw words.
Maximum entropy correction for small samples
With only \(n_{\text{words}}\) observations over \(4^{ws}\) possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over \(4^{ws}\) categories.
Let \(c = \lfloor n_{\text{words}} / 4^{ws} \rfloor\) and \(r = n_{\text{words}} \bmod 4^{ws}\). The most uniform integer distribution assigns frequency \(c+1\) to \(r\) categories and \(c\) to the remaining \(4^{ws} - r\), with the convention \(0 \log 0 = 0\):
When \(n_{\text{words}} < 4^{ws}\): \(c=0\), \(r=n_{\text{words}}\), and the formula reduces to \(H_{\max} = \log(n_{\text{words}})\) — a single unified expression covers both regimes. A truly random sequence achieves \(H_{\text{corr}} \approx H_{\max}\).
Normalized entropy
Final score
The filter computes \(\hat{H}(ws)\) for each word size ws from 1 to ws_max and returns the minimum:
A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \(\text{entropy}(kmer) \leq \theta\), where \(\theta\) is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.
Interpretation as an effective number of classes
\(H_{\text{corr}}\) is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: \(N_{\text{eff}} = e^{H_{\text{corr}}}\) is the number of equiprobable classes that would yield the same entropy.
For the normalised score \(\hat{H}\), dividing by \(H_{\text{max}}\) changes the logarithm base:
The property is preserved: \(\hat{H}\) is the logarithm (in base \(N_{\text{max}}\)) of the effective number of equi-represented classes.
In the large-sample limit (\(n_{\text{words}} \gg 4^{ws}\)), \(N_{\text{max}} \approx 4^{ws}\), giving:
This has a clean interpretation: \(ws \cdot \hat{H}\) is the effective word length (in bases) of a perfectly uniform distribution that would produce the same entropy. At \(\hat{H} = 1\) the full space of \(4^{ws}\) words is used; at \(\hat{H} = 0.5\) with ws=2, only \(4^1 = 4\) effective classes out of 16 are occupied.
In our actual regime, \(n_{\text{words}}\) is small and \(4^{ws}\) can exceed \(n_{\text{words}}\), so \(H_{\text{max}} < \log(4^{ws})\) due to the small-sample correction. The exact effective count is \(N_{\text{max}}^{\hat{H}}\), not \(4^{ws \cdot \hat{H}}\).
Properties
The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:
- Orientation invariance: \(\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))\), guaranteed by the strand-symmetric canonical form.
- Context independence: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.