Skip to content

Kmer entropy filter

Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.

Sub-word frequencies

For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the \(n_{\text{words}} = k - ws + 1\) overlapping sub-words by sliding a window of length ws:

\[w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1\]

Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word and all cyclic rotations of its reverse complement. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let \(s_j\) be the size of equivalence class \(j\) (number of distinct raw words mapping to canonical form \(j\)), and \(f_j\) the count of canonical form \(j\) among the \(n_{\text{words}}\) sub-words (\(\sum_j f_j = n_{\text{words}}\)).

Corrected Shannon entropy

The circular equivalence classes have unequal sizes: under a uniform distribution over all \(4^{ws}\) raw words, class \(j\) is visited with probability \(s_j / 4^{ws}\), not \(1/n_a\). Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.

The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class \(j\) equally among its \(s_j\) members:

\[H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j\]

The last term is the correction for unequal class sizes. For a uniformly random sequence (\(f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}\)), this gives \(H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2\), the maximum entropy over raw words.

Maximum entropy correction for small samples

With only \(n_{\text{words}}\) observations over \(4^{ws}\) possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over \(4^{ws}\) categories.

Let \(c = \lfloor n_{\text{words}} / 4^{ws} \rfloor\) and \(r = n_{\text{words}} \bmod 4^{ws}\). The most uniform integer distribution assigns frequency \(c+1\) to \(r\) categories and \(c\) to the remaining \(4^{ws} - r\), with the convention \(0 \log 0 = 0\):

\[H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]\]

When \(n_{\text{words}} < 4^{ws}\): \(c=0\), \(r=n_{\text{words}}\), and the formula reduces to \(H_{\max} = \log(n_{\text{words}})\) — a single unified expression covers both regimes. A truly random sequence achieves \(H_{\text{corr}} \approx H_{\max}\).

Normalized entropy

\[\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]\]

Final score

The filter computes \(\hat{H}(ws)\) for each word size ws from 1 to ws_max and returns the minimum:

\[\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)\]

A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \(\text{entropy}(kmer) \leq \theta\), where \(\theta\) is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.

Interpretation as an effective number of classes

\(H_{\text{corr}}\) is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: \(N_{\text{eff}} = e^{H_{\text{corr}}}\) is the number of equiprobable classes that would yield the same entropy.

For the normalised score \(\hat{H}\), dividing by \(H_{\text{max}}\) changes the logarithm base:

\[\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}\]

The property is preserved: \(\hat{H}\) is the logarithm (in base \(N_{\text{max}}\)) of the effective number of equi-represented classes.

In the large-sample limit (\(n_{\text{words}} \gg 4^{ws}\)), \(N_{\text{max}} \approx 4^{ws}\), giving:

\[N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}\]

This has a clean interpretation: \(ws \cdot \hat{H}\) is the effective word length (in bases) of a perfectly uniform distribution that would produce the same entropy. At \(\hat{H} = 1\) the full space of \(4^{ws}\) words is used; at \(\hat{H} = 0.5\) with ws=2, only \(4^1 = 4\) effective classes out of 16 are occupied.

In our actual regime, \(n_{\text{words}}\) is small and \(4^{ws}\) can exceed \(n_{\text{words}}\), so \(H_{\text{max}} < \log(4^{ws})\) due to the small-sample correction. The exact effective count is \(N_{\text{max}}^{\hat{H}}\), not \(4^{ws \cdot \hat{H}}\).

Properties

The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:

  • Orientation invariance: \(\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))\), guaranteed by the strand-symmetric canonical form.
  • Context independence: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.