mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

2.6 KiB

Raw Blame History

Obikmer: Efficient K-mer Encoding and Manipulation in Go

This package provides high-performance utilities for DNA sequence analysis using k-mers—contiguous substrings of length k. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-k-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.

Core Functionalities

K-mer Encoding (`EncodeKmers`, `IterKmers`)

Encodes DNA sequences (A/C/G/T/U, case-insensitive) into uint64 using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid k values.

Reverse Complement (`ReverseComplement`)

Computes the reverse complement of a k-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: RC(RC(x)) = x.

Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)

Returns the lexicographically smaller of a k-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (CanonicalKmer) and full-sequence canonical encoding.

Super k-mers Extraction (`ExtractSuperKmers`)

Groups overlapping k-mers sharing the same minimizer (minimal m-mer in sliding window) into contiguous regions ("super k-mers"). Output includes start/end positions and minimizer values, all canonicalized.

Error Marking (`SetKmerError`, `GetKmerError`, etc.)

Uses the top 2 bits of a uint64 to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying k-mer and work seamlessly with canonicalization/RC.

Key Features

Memory Efficiency: Reusable buffers via optional *[]uint64 or *[]SuperKmer parameters.
Edge Case Handling: Gracefully handles empty sequences, k > len(seq), invalid parameters (m ≥ k), and max-length constraints.
Performance: Optimized for speed—benchmarks included for all major functions (e.g., BenchmarkEncodeKmers, BenchmarkExtractSuperKmers).
Comprehensive Testing: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.

Use Cases

Genome assembly &DBG construction
Minimizer-based sketching (e.g., Mash, Sourmash)
Error-aware k-mer counting & filtering
Strand-unbiased sequence comparison

All functions operate on []byte DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.

2.6 KiB Raw Blame History

Obikmer: Efficient K-mer Encoding and Manipulation in Go

Core Functionalities

K-mer Encoding (EncodeKmers, IterKmers)

Reverse Complement (ReverseComplement)

Canonical K-mers (CanonicalKmer, EncodeCanonicalKmers)

Super k-mers Extraction (ExtractSuperKmers)

Error Marking (SetKmerError, GetKmerError, etc.)