mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,36 @@
|
||||
# Obikmer: Efficient K-mer Encoding and Manipulation in Go
|
||||
|
||||
This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
### K-mer Encoding (`EncodeKmers`, `IterKmers`)
|
||||
Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
|
||||
|
||||
### Reverse Complement (`ReverseComplement`)
|
||||
Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
|
||||
|
||||
### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
|
||||
Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
|
||||
|
||||
### Super *k*-mers Extraction (`ExtractSuperKmers`)
|
||||
Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
|
||||
|
||||
### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
|
||||
Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
|
||||
- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
|
||||
- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
|
||||
- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Genome assembly &DBG construction
|
||||
- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)
|
||||
- Error-aware k-mer counting & filtering
|
||||
- Strand-unbiased sequence comparison
|
||||
|
||||
All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
|
||||
Reference in New Issue
Block a user