mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
40 lines
2.4 KiB
Markdown
40 lines
2.4 KiB
Markdown
# Semantic Description of `obikmer` Package
|
||
|
||
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
|
||
|
||
## Core Encoding & Decoding
|
||
|
||
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
|
||
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
|
||
|
||
## Iterators (Memory-Efficient Streaming)
|
||
|
||
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
|
||
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for **odd k ≤ 31**.
|
||
|
||
## Error Handling & Markers
|
||
|
||
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
|
||
|
||
## Reverse Complement & Circular Normalization
|
||
|
||
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
|
||
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
|
||
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
|
||
|
||
## Counting & Math Utilities
|
||
|
||
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreau’s necklace formula**, with Euler's totient function and divisor enumeration.
|
||
|
||
## Performance & Safety
|
||
|
||
- All functions avoid heap allocations where possible (reusing buffers).
|
||
- Panics on invalid `k` or length mismatches for correctness.
|
||
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
|
||
|
||
## Use Cases
|
||
|
||
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
|
||
- Error-aware k-mer filtering in sequencing pipelines
|
||
- Low-complexity region detection via circular entropy normalization
|