Files
obitools4/autodoc/docmd/pkg/obikmer/entropy.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

32 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obikmer` Entropy Functions
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
## Core Functionality
- **`KmerEntropy(kmer, k, levelMax)`**:
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
- Normalized entropy = `(log(N) Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
- **`KmerEntropyFilter`**:
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
- **Not goroutine-safe** — each thread must instantiate its own filter.
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
- **`Accept(kmer)` / `Entropy(kmer)`**:
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
## Design Highlights
- **Circular canonical normalization** ensures symmetry (e.g., `AT``TA`).
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.