Kmer entropy filter

Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.

Sub-word frequencies

For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the n_{\text{words}} = k - ws + 1 overlapping sub-words by sliding a window of length ws:

w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1

Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word and all cyclic rotations of its reverse complement. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let s_j be the size of equivalence class j (number of distinct raw words mapping to canonical form j), and f_j the count of canonical form j among the n_{\text{words}} sub-words (\sum_j f_j = n_{\text{words}}).

Corrected Shannon entropy

The circular equivalence classes have unequal sizes: under a uniform distribution over all 4^{ws} raw words, class j is visited with probability s_j / 4^{ws}, not 1/n_a. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.

The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class j equally among its s_j members:

H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j

The last term is the correction for unequal class sizes. For a uniformly random sequence (f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}), this gives H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2, the maximum entropy over raw words.

Maximum entropy correction for small samples

With only n_{\text{words}} observations over 4^{ws} possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over 4^{ws} categories.

Let c = \lfloor n_{\text{words}} / 4^{ws} \rfloor and r = n_{\text{words}} \bmod 4^{ws}. The most uniform integer distribution assigns frequency c+1 to r categories and c to the remaining 4^{ws} - r, with the convention 0 \log 0 = 0:

H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]

When n_{\text{words}} < 4^{ws}: c=0, r=n_{\text{words}}, and the formula reduces to H_{\max} = \log(n_{\text{words}}) — a single unified expression covers both regimes. A truly random sequence achieves H_{\text{corr}} \approx H_{\max}.

Normalized entropy

\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]

Final score

The filter computes \hat{H}(ws) for each word size ws from 1 to ws_max and returns the minimum:

\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)

A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if \text{entropy}(kmer) \leq \theta, where \theta is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.

Interpretation as an effective number of classes

H_{\text{corr}} is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: N_{\text{eff}} = e^{H_{\text{corr}}} is the number of equiprobable classes that would yield the same entropy.

For the normalised score \hat{H}, dividing by H_{\text{max}} changes the logarithm base:

\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}

The property is preserved: \hat{H} is the logarithm (in base N_{\text{max}}) of the effective number of equi-represented classes.

In the large-sample limit (n_{\text{words}} \gg 4^{ws}), N_{\text{max}} \approx 4^{ws}, giving:

N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}

This has a clean interpretation: ws \cdot \hat{H} is the effective word length (in bases) of a perfectly uniform distribution that would produce the same entropy. At \hat{H} = 1 the full space of 4^{ws} words is used; at \hat{H} = 0.5 with ws=2, only 4^1 = 4 effective classes out of 16 are occupied.

In our actual regime, n_{\text{words}} is small and 4^{ws} can exceed n_{\text{words}}, so H_{\text{max}} < \log(4^{ws}) due to the small-sample correction. The exact effective count is N_{\text{max}}^{\hat{H}}, not 4^{ws \cdot \hat{H}}.

Properties

The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:

Orientation invariance: \text{entropy}(K) = \text{entropy}(\text{revcomp}(K)), guaranteed by the strand-symmetric canonical form.
Context independence: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.

5.5 KiB Raw Blame History