Files
obitools4/autodoc/docmd/pkg/obikmer/counting.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

37 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obikmer` Package
This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
## Core Functionalities
1. **`Count4Mer(seq, buffer, counts)`**
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
- Encodes each 4-mer into an integer (0255) using `Encode4mer`.
- Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
- Reuses or allocates the `counts` buffer as needed.
2. **`Common4Mer(count1, count2)`**
Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
Used to measure shared content between sequences.
3. **`Sum4Mer(count)`**
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
## Distance & Similarity Bounds
4. **`LCS4MerBounds(count1, count2)`**
Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
- **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
- **Upper bound**: `min(total1, total2) + 3 ceil((min_total common)/4)`
Leverages the fact that overlapping k-mers constrain possible alignments.
5. **`Error4MerBounds(count1, count2)`**
Estimates bounds for *alignment errors* (e.g., mismatches + indels):
- **Upper bound**: `max_total common_kmers + 2 * floor((common_kmers + 5)/8)`
- **Lower bound**: `ceil(upper_bound / 4)`
Provides fast, approximate error estimates without full alignment.
## Use Case
Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.