mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.0 KiB
2.0 KiB
Semantic Description of obikmer Entropy Functions
The obikmer package provides high-performance tools to compute Shannon entropy for DNA k-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
Core Functionality
-
KmerEntropy(kmer, k, levelMax):
Computes the minimum normalized Shannon entropy across all sub-word sizes from1tolevelMax.- Decodes the encoded k-mer (2 bits/base) into a DNA string.
- For each word size
ws, extracts all overlapping substrings, normalizes them to their circular canonical form, and counts frequencies. - Normalized entropy =
(log(N) − Σ(nᵢ log nᵢ)/N) / emax, whereemaxis the theoretical max entropy given sequence length and alphabet constraints. - Returns min entropy across
ws ∈ [1, levelMax]. Values near 0 indicate repeats (e.g.,AAAAA…); values near 1 suggest high complexity.
-
KmerEntropyFilter:
A reusable, precomputed filter for batch processing millions of k-mers efficiently:- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (
emax,logNwords), and frequency tables. - Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
- Not goroutine-safe — each thread must instantiate its own filter.
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (
-
NewKmerEntropyFilter(k, levelMax, threshold):
Initializes a filter with precomputed tables and sets the entropy rejectionthreshold. -
Accept(kmer)/Entropy(kmer):Accept()returnstrueif entropy > threshold (i.e., k-mer is complex enough to pass).Entropy()computes entropy using precomputed tables — ~10× faster than standalone calls.
Design Highlights
- Circular canonical normalization ensures symmetry (e.g.,
AT≡TA). - Sub-word-level entropy captures local repetitiveness better than global k-mer uniqueness.
- Optimized for speed and memory reuse, suitable for large-scale genomic data filtering.