Files
obitools4/autodoc/docmd/pkg/obikmer/encodekmer.md
T

40 lines
2.4 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
## Core Encoding & Decoding
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
## Iterators (Memory-Efficient Streaming)
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 03). Only valid for **odd k ≤ 31**.
## Error Handling & Markers
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
## Reverse Complement & Circular Normalization
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
## Counting & Math Utilities
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreaus necklace formula**, with Euler's totient function and divisor enumeration.
## Performance & Safety
- All functions avoid heap allocations where possible (reusing buffers).
- Panics on invalid `k` or length mismatches for correctness.
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
## Use Cases
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
- Error-aware k-mer filtering in sequencing pipelines
- Low-complexity region detection via circular entropy normalization