Files
obitools4/autodoc/docmd/pkg/obikmer/superkmer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.9 KiB
Raw Blame History

SuperKmer and Minimizer-Based Sliding Window Analysis

This Go package provides functionality for extracting super k-mers from DNA sequences using a minimizer-based sliding window approach.

Core Concepts

  • K-mers: Substrings of length k from a DNA sequence.
  • Minimizer: The lexicographically smallest canonical m-mer (substring of length m) among all (k m + 1) overlapping m-mers in a given k-mer.
  • Super K-mer: A maximal contiguous subsequence where every consecutive k-mer shares the same minimizer.

Data Structures

SuperKmer

Represents a maximal region with uniform minimizer:

  • Minimizer: Canonical 64-bit hash of the shared m-mer.
  • Start, End: Slice-style bounds (0-indexed, exclusive end).
  • Sequence: Raw byte slice of the DNA subsequence.

dequeItem

Used internally to maintain a monotone deque:

  • position: Index of the m-mer in the sequence.
  • canonical: Canonical hash value (e.g., lexicographically smallest of forward/reverse-complement).

Main Function

ExtractSuperKmers(seq, k, m, buffer)

  • Extracts all maximal super k-mers from seq.
  • Parameters validated:
    • 1 ≤ m < k,
    • 2 ≤ k ≤ 31,
    • sequence length ≥ k.
  • Uses an efficient O(n) time algorithm via internal iteration.
  • Supports optional preallocation (buffer) to reduce memory allocations.

Algorithm Highlights

  • Maintains a sliding window of size k m + 1 over m-mers.
  • Tracks the current minimizer using a monotone deque for O(1) updates per step.
  • Detects minimizer transitions to delimit super k-mer boundaries.

Complexity

Aspect Bound
Time O(n) (linear in sequence length)
Space O(k m + 1) for deque + output size

Useful in genome compression, read clustering, and minimizer-based alignment acceleration.