mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.4 KiB
2.4 KiB
Semantic Description of obikmer Package
The obikmer package provides high-performance, zero-allocation utilities for k-mer manipulation in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
Core Encoding & Decoding
EncodeKmer,DecodeKmer: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.EncodeCanonicalKmer,CanonicalKmer: Encode or normalize k-mers to their biological canonical form — the lexicographically smaller of a k-mer and its reverse complement.
Iterators (Memory-Efficient Streaming)
IterKmers,IterCanonicalKmers: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).IterCanonicalKmersWithErrors: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for odd k ≤ 31.
Error Handling & Markers
SetKmerError,GetKmerError, andClearKmerErrormanipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
Reverse Complement & Circular Normalization
ReverseComplement,CanonicalKmer: Compute biological reverse complement and canonical form.NormalizeCircular,EncodeCircularCanonicalKmer: Compute circular canonical form — the lexicographically smallest rotation (used for low-complexity masking).- Distinction:
CanonicalKmeruses reverse complement, whileNormalizeCircularuses rotation.
Counting & Math Utilities
CanonicalCircularKmerCount,necklaceCount, etc.: Compute exact counts of unique circular k-mer equivalence classes using Moreau’s necklace formula, with Euler's totient function and divisor enumeration.
Performance & Safety
- All functions avoid heap allocations where possible (reusing buffers).
- Panics on invalid
kor length mismatches for correctness. - Supports case-insensitive input (A/a, T/t…), and ambiguous bases via
__single_base_code_err__.
Use Cases
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
- Error-aware k-mer filtering in sequencing pipelines
- Low-complexity region detection via circular entropy normalization