Files
obitools4/autodoc/docmd/pkg/obikmer/debruijn.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

45 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of the `obikmer` Package
This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
### Core Functionalities
- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
### Graph Operations
- **Node Queries**:
- `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.
- `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
- **Path Exploration**:
- `MaxPath()`: Greedily traces the highest-weight path from a head.
- `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
- `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
### Consensus & Filtering
- **Consensus Generation**:
- `BestConsensus()` returns a sequence from the greedy max-weight path.
- `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
- **Weight Statistics**:
- `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.
- `FilterMinWeight(min)` removes low-count nodes.
- **Decoding**:
- `DecodeNode()` converts a k-mer index to its DNA string.
- `DecodePath()` reconstructs the full consensus from a path.
### I/O & Diagnostics
- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
### Dependencies & Design
- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Gos stdlib.
- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.