obitools4/autodoc/docmd/pkg/obikmer/counting.md

# Semantic Description of `obikmer` Package

This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.

## Core Functionalities

1. **`Count4Mer(seq, buffer, counts)`**
   Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
   - Encodes each 4-mer into an integer (0–255) using `Encode4mer`.
   - Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
   - Reuses or allocates the `counts` buffer as needed.

2. **`Common4Mer(count1, count2)`**
   Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
   Used to measure shared content between sequences.

3. **`Sum4Mer(count)`**
   Returns the total number of 4-mers in a profile (i.e., sum over all entries).

## Distance & Similarity Bounds

4. **`LCS4MerBounds(count1, count2)`**
   Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
   - **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
   - **Upper bound**: `min(total1, total2) + 3 − ceil((min_total – common)/4)`
   Leverages the fact that overlapping k-mers constrain possible alignments.

5. **`Error4MerBounds(count1, count2)`**
   Estimates bounds for *alignment errors* (e.g., mismatches + indels):
   - **Upper bound**: `max_total − common_kmers + 2 * floor((common_kmers + 5)/8)`
   - **Lower bound**: `ceil(upper_bound / 4)`
   Provides fast, approximate error estimates without full alignment.

## Use Case

Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.