<p>Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a <strong>normalized Shannon entropy</strong> over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.</p>
<p>For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the <spanclass="arithmatex">\(n_{\text{words}} = k - ws + 1\)</span> overlapping sub-words by sliding a window of length ws:</p>
<p>Each sub-word is mapped to its <strong>circular canonical form</strong>: the lexicographic minimum among all cyclic rotations of the word <strong>and all cyclic rotations of its reverse complement</strong>. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let <spanclass="arithmatex">\(s_j\)</span> be the size of equivalence class <spanclass="arithmatex">\(j\)</span> (number of distinct raw words mapping to canonical form <spanclass="arithmatex">\(j\)</span>), and <spanclass="arithmatex">\(f_j\)</span> the count of canonical form <spanclass="arithmatex">\(j\)</span> among the <spanclass="arithmatex">\(n_{\text{words}}\)</span> sub-words (<spanclass="arithmatex">\(\sum_j f_j = n_{\text{words}}\)</span>).</p>
<p>The circular equivalence classes have unequal sizes: under a uniform distribution over all <spanclass="arithmatex">\(4^{ws}\)</span> raw words, class <spanclass="arithmatex">\(j\)</span> is visited with probability <spanclass="arithmatex">\(s_j / 4^{ws}\)</span>, not <spanclass="arithmatex">\(1/n_a\)</span>. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.</p>
<p>The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class <spanclass="arithmatex">\(j\)</span> equally among its <spanclass="arithmatex">\(s_j\)</span> members:</p>
<p>The last term is the correction for unequal class sizes. For a uniformly random sequence (<spanclass="arithmatex">\(f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}\)</span>), this gives <spanclass="arithmatex">\(H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2\)</span>, the maximum entropy over raw words.</p>
<h2id="maximum-entropy-correction-for-small-samples">Maximum entropy correction for small samples</h2>
<p>With only <spanclass="arithmatex">\(n_{\text{words}}\)</span> observations over <spanclass="arithmatex">\(4^{ws}\)</span> possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over <spanclass="arithmatex">\(4^{ws}\)</span> categories.</p>
<p>Let <spanclass="arithmatex">\(c = \lfloor n_{\text{words}} / 4^{ws} \rfloor\)</span> and <spanclass="arithmatex">\(r = n_{\text{words}} \bmod 4^{ws}\)</span>. The most uniform integer distribution assigns frequency <spanclass="arithmatex">\(c+1\)</span> to <spanclass="arithmatex">\(r\)</span> categories and <spanclass="arithmatex">\(c\)</span> to the remaining <spanclass="arithmatex">\(4^{ws} - r\)</span>, with the convention <spanclass="arithmatex">\(0 \log 0 = 0\)</span>:</p>
<p>When <spanclass="arithmatex">\(n_{\text{words}} < 4^{ws}\)</span>: <spanclass="arithmatex">\(c=0\)</span>, <spanclass="arithmatex">\(r=n_{\text{words}}\)</span>, and the formula reduces to <spanclass="arithmatex">\(H_{\max} = \log(n_{\text{words}})\)</span> — a single unified expression covers both regimes. A truly random sequence achieves <spanclass="arithmatex">\(H_{\text{corr}} \approx H_{\max}\)</span>.</p>
<p>The filter computes <spanclass="arithmatex">\(\hat{H}(ws)\)</span> for each word size ws from 1 to ws_max and returns the <strong>minimum</strong>:</p>
<p>A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if <spanclass="arithmatex">\(\text{entropy}(kmer) \leq \theta\)</span>, where <spanclass="arithmatex">\(\theta\)</span> is a collection parameter. The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.</p>
<h2id="interpretation-as-an-effective-number-of-classes">Interpretation as an effective number of classes</h2>
<p><spanclass="arithmatex">\(H_{\text{corr}}\)</span> is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: <spanclass="arithmatex">\(N_{\text{eff}} = e^{H_{\text{corr}}}\)</span> is the number of equiprobable classes that would yield the same entropy.</p>
<p>For the normalised score <spanclass="arithmatex">\(\hat{H}\)</span>, dividing by <spanclass="arithmatex">\(H_{\text{max}}\)</span> changes the logarithm base:</p>
<p>The property is preserved: <spanclass="arithmatex">\(\hat{H}\)</span> is the logarithm (in base <spanclass="arithmatex">\(N_{\text{max}}\)</span>) of the effective number of equi-represented classes.</p>
<p>This has a clean interpretation: <spanclass="arithmatex">\(ws \cdot \hat{H}\)</span> is the <strong>effective word length</strong> (in bases) of a perfectly uniform distribution that would produce the same entropy. At <spanclass="arithmatex">\(\hat{H} = 1\)</span> the full space of <spanclass="arithmatex">\(4^{ws}\)</span> words is used; at <spanclass="arithmatex">\(\hat{H} = 0.5\)</span> with ws=2, only <spanclass="arithmatex">\(4^1 = 4\)</span> effective classes out of 16 are occupied.</p>
<p>In our actual regime, <spanclass="arithmatex">\(n_{\text{words}}\)</span> is small and <spanclass="arithmatex">\(4^{ws}\)</span> can exceed <spanclass="arithmatex">\(n_{\text{words}}\)</span>, so <spanclass="arithmatex">\(H_{\text{max}} < \log(4^{ws})\)</span> due to the small-sample correction. The exact effective count is <spanclass="arithmatex">\(N_{\text{max}}^{\hat{H}}\)</span>, not <spanclass="arithmatex">\(4^{ws \cdot \hat{H}}\)</span>.</p>
<h2id="properties">Properties</h2>
<p>The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:</p>
<ul>
<li><strong>Orientation invariance</strong>: <spanclass="arithmatex">\(\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))\)</span>, guaranteed by the strand-symmetric canonical form.</li>
<li><strong>Context independence</strong>: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.</li>
<scriptid="__config"type="application/json">{"annotate":null,"base":"../..","features":[],"search":"../../assets/javascripts/workers/search.2c215733.min.js","tags":null,"translations":{"clipboard.copied":"Copied to clipboard","clipboard.copy":"Copy to clipboard","search.result.more.one":"1 more on this page","search.result.more.other":"# more on this page","search.result.none":"No matching documents","search.result.one":"1 matching document","search.result.other":"# matching documents","search.result.placeholder":"Type to start searching","search.result.term.missing":"Missing","select.version":"Select version"},"version":null}</script>