⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,36 @@
+# Obikmer: Efficient K-mer Encoding and Manipulation in Go
+
+This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
+
+## Core Functionalities
+
+### K-mer Encoding (`EncodeKmers`, `IterKmers`)
+Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
+
+### Reverse Complement (`ReverseComplement`)
+Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
+
+### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
+Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
+
+### Super *k*-mers Extraction (`ExtractSuperKmers`)
+Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
+
+### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
+Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
+
+## Key Features
+
+- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
+- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
+- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
+- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
+
+## Use Cases
+
+- Genome assembly &DBG construction  
+- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)  
+- Error-aware k-mer counting & filtering  
+- Strand-unbiased sequence comparison  
+
+All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.