Files
obitools4/autodoc/docmd/pkg/obikmer/encodekmer_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.6 KiB

Obikmer: Efficient K-mer Encoding and Manipulation in Go

This package provides high-performance utilities for DNA sequence analysis using k-mers—contiguous substrings of length k. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-k-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.

Core Functionalities

K-mer Encoding (EncodeKmers, IterKmers)

Encodes DNA sequences (A/C/G/T/U, case-insensitive) into uint64 using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid k values.

Reverse Complement (ReverseComplement)

Computes the reverse complement of a k-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: RC(RC(x)) = x.

Canonical K-mers (CanonicalKmer, EncodeCanonicalKmers)

Returns the lexicographically smaller of a k-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (CanonicalKmer) and full-sequence canonical encoding.

Super k-mers Extraction (ExtractSuperKmers)

Groups overlapping k-mers sharing the same minimizer (minimal m-mer in sliding window) into contiguous regions ("super k-mers"). Output includes start/end positions and minimizer values, all canonicalized.

Error Marking (SetKmerError, GetKmerError, etc.)

Uses the top 2 bits of a uint64 to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying k-mer and work seamlessly with canonicalization/RC.

Key Features

  • Memory Efficiency: Reusable buffers via optional *[]uint64 or *[]SuperKmer parameters.
  • Edge Case Handling: Gracefully handles empty sequences, k > len(seq), invalid parameters (m ≥ k), and max-length constraints.
  • Performance: Optimized for speed—benchmarks included for all major functions (e.g., BenchmarkEncodeKmers, BenchmarkExtractSuperKmers).
  • Comprehensive Testing: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.

Use Cases

  • Genome assembly &DBG construction
  • Minimizer-based sketching (e.g., Mash, Sourmash)
  • Error-aware k-mer counting & filtering
  • Strand-unbiased sequence comparison

All functions operate on []byte DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.