Files
obitools4/autodoc/docmd/pkg/obikmer/entropy.md
T

32 lines
2.0 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Semantic Description of `obikmer` Entropy Functions
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
## Core Functionality
- **`KmerEntropy(kmer, k, levelMax)`**:
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
- Normalized entropy = `(log(N) Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
- **`KmerEntropyFilter`**:
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
- **Not goroutine-safe** — each thread must instantiate its own filter.
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
- **`Accept(kmer)` / `Entropy(kmer)`**:
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
## Design Highlights
- **Circular canonical normalization** ensures symmetry (e.g., `AT``TA`).
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.