Files
obitools4/autodoc/docmd/pkg/obikmer/encodekmer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

40 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
## Core Encoding & Decoding
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
## Iterators (Memory-Efficient Streaming)
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 03). Only valid for **odd k ≤ 31**.
## Error Handling & Markers
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
## Reverse Complement & Circular Normalization
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
## Counting & Math Utilities
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreaus necklace formula**, with Euler's totient function and divisor enumeration.
## Performance & Safety
- All functions avoid heap allocations where possible (reusing buffers).
- Panics on invalid `k` or length mismatches for correctness.
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
## Use Cases
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
- Error-aware k-mer filtering in sequencing pipelines
- Low-complexity region detection via circular entropy normalization