Files
obitools4/autodoc/docmd/pkg/obikmer/encodefourmer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB
Raw Blame History

Semantic Description of obikmer Package

The obikmer package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.

Core Functionalities

  1. Nucleotide Encoding

    • EncodeNucleotide(b byte): Maps DNA bases (A, C, G, T/U) to 2-bit values:
      0→A, 1→C, 2→G, 3→T/U.
      Ambiguous or non-standard characters (e.g., N, R, Y) default to A (0).
      Uses a lookup table for O(1) performance.
  2. 4-mer Encoding

    • Encode4mer(seq, buffer): Converts a biological sequence into overlapping 4-mers.
      Each k-mer is encoded as an unsigned byte (0255), where each nucleotide contributes 2 bits.
      Supports optional buffer reuse for memory efficiency.
  3. 4-mer Indexing

    • Index4mer(seq, index, buffer): Builds an inverted index mapping each 4-mer code (0255) to all its occurrence positions in the sequence.
      Returns [][]int, where rows correspond to k-mer codes and columns list positions.
  4. Fast Sequence Comparison

    • FastShiftFourMer(...): Compares two sequences using a FASTA-like shift-scoring algorithm.
      • Uses precomputed 4-mer index of a reference sequence and encodes the query.
      • Counts co-occurring 4-mers across all possible shifts (refpos queryPos).
      • Computes raw and relative scores (normalized by alignment length).
      • Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).

Design Highlights

  • Memory-aware: Supports buffer reuse to minimize allocations.
  • Robustness: Non-canonical bases handled gracefully (defaulting to A).
  • Performance-oriented: O(n) encoding and indexing; efficient hash-based shift counting.

Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.