Files
obitools4/autodoc/docmd/pkg/obikmer/superkmer_iter_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB

Semantic Description of obikmer Package Functionalities

The obikmer package provides tools for super k-mer extraction and minimizer-based sequence analysis in bioinformatics.

Core Concepts

A super k-mer is a maximal contiguous subsequence of DNA where all embedded k-mers share the same minimizer—a compact representative (typically lexicographically minimal) of m-mers, considering both forward and reverse-complement strands.

Key Functions & Features

  • IterSuperKmers(seq, k, m):
    An iterator over all super k-mers in input sequence seq, parameterized by:

    • k: length of embedded k-mers,
    • m: size of minimizer window (m ≤ k).
      Yields structured objects with:
    • Sequence: the super k-mer substring,
    • Start/End: genomic coordinates (0-based half-open),
    • Minimizer: canonical hash of the shared minimizer.
  • ExtractSuperKmers(...):
    Synchronous counterpart returning a slice of all super k-mers.

Verified Properties (via Tests)

  1. Boundary correctness: Extracted subsequences match seq[start:end].
  2. Consistency between iterator and slice versions: Both APIs produce identical results.
  3. Bijection property:
    • Each unique super k-mer sequence maps to exactly one minimizer.
    • All embedded k-mers within a super k-mer share the same minimizer.

Implementation Notes

  • Minimizers are computed canonically (min of forward and reverse-complement encodings).
  • Uses base encoding via __single_base_code__ (assumed helper mapping A/C/G/T → 0/1/2/3).
  • Tests cover simple, homopolymer-rich, and complex genomic patterns.

Design Rationale

Super k-mers enable efficient compression, indexing (e.g., in minimizer spaces), and alignment-free comparisons—crucial for scalable genomic analysis.