docmd/theory/entropy.md

# Kmer entropy filter

Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a **normalized Shannon entropy** over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.

## Sub-word frequencies

For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the $n_{\text{words}} = k - ws + 1$ overlapping sub-words by sliding a window of length ws:

$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$

Each sub-word is mapped to its **circular canonical form**: the lexicographic minimum among all cyclic rotations of the word **and all cyclic rotations of its reverse complement**. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let $s_j$ be the size of equivalence class $j$ (number of distinct raw words mapping to canonical form $j$), and $f_j$ the count of canonical form $j$ among the $n_{\text{words}}$ sub-words ($\sum_j f_j = n_{\text{words}}$).

## Corrected Shannon entropy

The circular equivalence classes have unequal sizes: under a uniform distribution over all $4^{ws}$ raw words, class $j$ is visited with probability $s_j / 4^{ws}$, not $1/n_a$. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.

The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class $j$ equally among its $s_j$ members:

$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$

The last term is the correction for unequal class sizes. For a uniformly random sequence ($f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}$), this gives $H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2$, the maximum entropy over raw words.

## Maximum entropy correction for small samples

With only $n_{\text{words}}$ observations over $4^{ws}$ possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over $4^{ws}$ categories.

Let $c = \lfloor n_{\text{words}} / 4^{ws} \rfloor$ and $r = n_{\text{words}} \bmod 4^{ws}$. The most uniform integer distribution assigns frequency $c+1$ to $r$ categories and $c$ to the remaining $4^{ws} - r$, with the convention $0 \log 0 = 0$:

$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$

When $n_{\text{words}} < 4^{ws}$: $c=0$, $r=n_{\text{words}}$, and the formula reduces to $H_{\max} = \log(n_{\text{words}})$ — a single unified expression covers both regimes. A truly random sequence achieves $H_{\text{corr}} \approx H_{\max}$.

## Normalized entropy

$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$

## Final score

The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and returns the **minimum**:

$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$

A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) < \theta$, where $\theta$ is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.

## Interpretation as an effective number of classes

$H_{\text{corr}}$ is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: $N_{\text{eff}} = e^{H_{\text{corr}}}$ is the number of equiprobable classes that would yield the same entropy.

For the normalised score $\hat{H}$, dividing by $H_{\text{max}}$ changes the logarithm base:

$$\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}$$

The property is preserved: $\hat{H}$ is the logarithm (in base $N_{\text{max}}$) of the effective number of equi-represented classes.

In the large-sample limit ($n_{\text{words}} \gg 4^{ws}$), $N_{\text{max}} \approx 4^{ws}$, giving:

$$N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}$$

This has a clean interpretation: $ws \cdot \hat{H}$ is the **effective word length** (in bases) of a perfectly uniform distribution that would produce the same entropy. At $\hat{H} = 1$ the full space of $4^{ws}$ words is used; at $\hat{H} = 0.5$ with ws=2, only $4^1 = 4$ effective classes out of 16 are occupied.

In our actual regime, $n_{\text{words}}$ is small and $4^{ws}$ can exceed $n_{\text{words}}$, so $H_{\text{max}} < \log(4^{ws})$ due to the small-sample correction. The exact effective count is $N_{\text{max}}^{\hat{H}}$, not $4^{ws \cdot \hat{H}}$.

## Properties

The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:

- **Orientation invariance**: $\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))$, guaranteed by the strand-symmetric canonical form.
- **Context independence**: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.
first implementation but far to be optimal 2026-04-16 22:38:20 +02:00			`# Kmer entropy filter`

			`Low-complexity kmers (polyA, polyT, tandem repeats) are detected and excluded during phase 1. The filter computes a normalized Shannon entropy over sub-words of multiple sizes, corrected for two sources of bias: the small number of observations within a single kmer, and the unequal sizes of circular equivalence classes.`

			`## Sub-word frequencies`

			`For a kmer of length k and a sub-word size ws (1 ≤ ws ≤ ws_max, typically ws_max = 6), extract the $n_{\text{words}} = k - ws + 1$ overlapping sub-words by sliding a window of length ws:`

			`$$w_i = \text{kmer}[i \mathinner{..} i+ws-1], \quad i = 0, \ldots, n_{\text{words}}-1$$`

			Each sub-word is mapped to its circular canonical form: the lexicographic minimum among all cyclic rotations of the word and all cyclic rotations of its reverse complement. This extended equivalence relation ensures that entropy(K) = entropy(revcomp(K)) — the filter is strand-symmetric. Let $s_j$ be the size of equivalence class $j$ (number of distinct raw words mapping to canonical form $j$), and $f_j$ the count of canonical form $j$ among the $n_{\text{words}}$ sub-words ($\sum_j f_j = n_{\text{words}}$).

			`## Corrected Shannon entropy`

			`The circular equivalence classes have unequal sizes: under a uniform distribution over all $4^{ws}$ raw words, class $j$ is visited with probability $s_j / 4^{ws}$, not $1/n_a$. Computing entropy directly over canonical classes therefore underestimates the entropy of a random sequence.`

			`The correction "unfolds" each canonical class back to its member raw words, redistributing each observation of class $j$ equally among its $s_j$ members:`

			`$$H_{\text{corr}} = \log(n_{\text{words}}) - \frac{1}{n_{\text{words}}} \sum_j f_j \log f_j + \frac{1}{n_{\text{words}}} \sum_j f_j \log s_j$$`

			`The last term is the correction for unequal class sizes. For a uniformly random sequence ($f_j \approx n_{\text{words}} \cdot s_j / 4^{ws}$), this gives $H_{\text{corr}} \approx \log(4^{ws}) = 2 \cdot ws \cdot \log 2$, the maximum entropy over raw words.`

			`## Maximum entropy correction for small samples`

			`With only $n_{\text{words}}$ observations over $4^{ws}$ possible raw words, the achievable maximum entropy is bounded by the most uniform integer distribution over $4^{ws}$ categories.`

			`Let $c = \lfloor n_{\text{words}} / 4^{ws} \rfloor$ and $r = n_{\text{words}} \bmod 4^{ws}$. The most uniform integer distribution assigns frequency $c+1$ to $r$ categories and $c$ to the remaining $4^{ws} - r$, with the convention $0 \log 0 = 0$:`

			`$$H_{\max} = -\left[(4^{ws} - r)\,\frac{c}{n_{\text{words}}}\log\frac{c}{n_{\text{words}}} + r\,\frac{c+1}{n_{\text{words}}}\log\frac{c+1}{n_{\text{words}}}\right]$$`

			`When $n_{\text{words}} < 4^{ws}$: $c=0$, $r=n_{\text{words}}$, and the formula reduces to $H_{\max} = \log(n_{\text{words}})$ — a single unified expression covers both regimes. A truly random sequence achieves $H_{\text{corr}} \approx H_{\max}$.`

			`## Normalized entropy`

			`$$\hat{H}(ws) = \frac{H_{\text{corr}}}{H_{\max}} \in [0, 1]$$`

			`## Final score`

			`The filter computes $\hat{H}(ws)$ for each word size ws from 1 to ws_max and returns the minimum:`

			`$$\text{entropy}(kmer) = \min_{ws=1}^{ws_{\max}} \hat{H}(ws)$$`

refactor: update core types and add approximate evidence support 2026-05-26 09:12:41 +02:00			`A value near 0 indicates low complexity (e.g. AAAA…); near 1 indicates high complexity. A kmer is rejected if $\text{entropy}(kmer) < \theta$, where $\theta$ is a collection parameter (default 0.7). The minimum across word sizes ensures that any scale of repetition is detected independently: polyA is caught at ws=1, dinucleotide repeats at ws=2, etc.`
first implementation but far to be optimal 2026-04-16 22:38:20 +02:00
			`## Interpretation as an effective number of classes`

			`$H_{\text{corr}}$ is a standard Shannon entropy over raw words (after unfolding the equivalence classes), so the classical perplexity interpretation holds directly: $N_{\text{eff}} = e^{H_{\text{corr}}}$ is the number of equiprobable classes that would yield the same entropy.`

			`For the normalised score $\hat{H}$, dividing by $H_{\text{max}}$ changes the logarithm base:`

			`$$\hat{H} = \frac{\log N_{\text{eff}}}{\log N_{\text{max}}} = \log_{N_{\text{max}}} N_{\text{eff}} \quad \Longleftrightarrow \quad N_{\text{eff}} = N_{\text{max}}^{\,\hat{H}}$$`

			`The property is preserved: $\hat{H}$ is the logarithm (in base $N_{\text{max}}$) of the effective number of equi-represented classes.`

			`In the large-sample limit ($n_{\text{words}} \gg 4^{ws}$), $N_{\text{max}} \approx 4^{ws}$, giving:`

			`$$N_{\text{eff}} \approx 4^{ws \cdot \hat{H}}$$`

			`This has a clean interpretation: $ws \cdot \hat{H}$ is the effective word length (in bases) of a perfectly uniform distribution that would produce the same entropy. At $\hat{H} = 1$ the full space of $4^{ws}$ words is used; at $\hat{H} = 0.5$ with ws=2, only $4^1 = 4$ effective classes out of 16 are occupied.`

			`In our actual regime, $n_{\text{words}}$ is small and $4^{ws}$ can exceed $n_{\text{words}}$, so $H_{\text{max}} < \log(4^{ws})$ due to the small-sample correction. The exact effective count is $N_{\text{max}}^{\hat{H}}$, not $4^{ws \cdot \hat{H}}$.`

			`## Properties`

			`The entropy score is a function of the kmer sequence alone — it does not depend on the surrounding context or on the position within any genome. Two consequences:`

			`- Orientation invariance: $\text{entropy}(K) = \text{entropy}(\text{revcomp}(K))$, guaranteed by the strand-symmetric canonical form.`
			`- Context independence: the same kmer is always rejected or always kept, regardless of which genome it occurs in, where in that genome it appears, or which strand is considered. The filter defines a fixed partition of the kmer space into low-complexity and valid kmers.`