mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
32 lines
2.0 KiB
Markdown
32 lines
2.0 KiB
Markdown
|
|
# Semantic Description of `obikmer` Entropy Functions
|
|||
|
|
|
|||
|
|
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
|
|||
|
|
|
|||
|
|
## Core Functionality
|
|||
|
|
|
|||
|
|
- **`KmerEntropy(kmer, k, levelMax)`**:
|
|||
|
|
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
|
|||
|
|
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
|
|||
|
|
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
|
|||
|
|
- Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
|
|||
|
|
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
|
|||
|
|
|
|||
|
|
- **`KmerEntropyFilter`**:
|
|||
|
|
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
|
|||
|
|
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
|
|||
|
|
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
|
|||
|
|
- **Not goroutine-safe** — each thread must instantiate its own filter.
|
|||
|
|
|
|||
|
|
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
|
|||
|
|
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
|
|||
|
|
|
|||
|
|
- **`Accept(kmer)` / `Entropy(kmer)`**:
|
|||
|
|
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
|
|||
|
|
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
|
|||
|
|
|
|||
|
|
## Design Highlights
|
|||
|
|
|
|||
|
|
- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).
|
|||
|
|
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
|
|||
|
|
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
|