- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.8 KiB
Semantic Description of obikmer Package
This Go package provides utilities for k-mer (specifically 4-mer) counting and comparison of biological sequences.
Core Functionalities
-
Count4Mer(seq, buffer, counts)
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in aBioSequence.- Encodes each 4-mer into an integer (0–255) using
Encode4mer. - Populates a fixed-size
[256]uint16table (Table4mer) with counts. - Reuses or allocates the
countsbuffer as needed.
- Encodes each 4-mer into an integer (0–255) using
-
Common4Mer(count1, count2)
Computes the intersection of two 4-mer frequency profiles: sum over all k-mers ofmin(count1[k], count2[k]).
Used to measure shared content between sequences. -
Sum4Mer(count)
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
Distance & Similarity Bounds
-
LCS4MerBounds(count1, count2)
Estimates bounds for the Longest Common Subsequence (LCS) length between two sequences based on 4-mer profiles:- Lower bound:
common_kmers + (3 if common > 0 else 0) - Upper bound:
min(total1, total2) + 3 − ceil((min_total – common)/4)
Leverages the fact that overlapping k-mers constrain possible alignments.
- Lower bound:
-
Error4MerBounds(count1, count2)
Estimates bounds for alignment errors (e.g., mismatches + indels):- Upper bound:
max_total − common_kmers + 2 * floor((common_kmers + 5)/8) - Lower bound:
ceil(upper_bound / 4)
Provides fast, approximate error estimates without full alignment.
- Upper bound:
Use Case
Designed for high-performance comparison of NGS reads (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.