obitools4/autodoc/docmd/pkg/obikmer/entropy.md

# Semantic Description of `obikmer` Entropy Functions

The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.

## Core Functionality

- **`KmerEntropy(kmer, k, levelMax)`**:
  Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
  - Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
  - For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
  - Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
  - Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.

- **`KmerEntropyFilter`**:
  A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
  - Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
  - Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
  - **Not goroutine-safe** — each thread must instantiate its own filter.

- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
  Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.

- **`Accept(kmer)` / `Entropy(kmer)`**:
  - `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
  - `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.

## Design Highlights

- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.