Files
obitools4/autodoc/docmd/pkg/obikmer/debruijn.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.6 KiB
Raw Blame History

Semantic Description of the obikmer Package

This Go package implements a De Bruijn graph for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).

Core Functionalities

  • K-mer Encoding: K-mers are encoded as uint64 using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the iupac map.
  • Reverse Complement Handling: The revcompnuc table enables nucleotide-wise reverse complementation.
  • Graph Construction: The DeBruijnGraph struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (kmermask, prevc/g/t).

Graph Operations

  • Node Queries:
    • Previouses() / Nexts(): Return predecessor/successor k-mers in the graph.
    • MaxNext() / MaxHead(): Find neighbors or heads (sources) with maximum weight.
  • Path Exploration:
    • MaxPath(): Greedily traces the highest-weight path from a head.
    • LongestPath(): Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
    • HaviestPath(): Uses Dijkstra-like priority queue to find the heaviest (sum-weight) path, with cycle detection via DFS (HasCycle()).

Consensus & Filtering

  • Consensus Generation:
    • BestConsensus() returns a sequence from the greedy max-weight path.
    • LongestConsensus(id, min_cov) trims low-coverage ends using a coverage threshold (mode-based).
  • Weight Statistics:
    • MaxWeight(), WeightMean(), WeightMode() provide distribution summaries.
    • FilterMinWeight(min) removes low-count nodes.
  • Decoding:
    • DecodeNode() converts a k-mer index to its DNA string.
    • DecodePath() reconstructs the full consensus from a path.

I/O & Diagnostics

  • GML Export: WriteGml() outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
  • Hamming Distance: HammingDistance() computes edit distance between two encoded k-mers using bit operations.
  • Sequence Insertion: Push() adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.

Dependencies & Design

  • Leverages obiseq for sequence representation and logrus/slices/heap from Gos stdlib.
  • Designed for scalability and speed, using bit-level operations to minimize memory footprint.

Overall: a robust k-mer graph engine for de novo assembly, error correction, and consensus recovery in high-throughput sequencing data.