Files
obitools4/autodoc/docmd/pkg/obialign/fastlcsegf.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.1 KiB

Semantic Description of obialign Package

The obialign package provides high-performance functions for computing the Longest Common Subsequence (LCS) between two biological sequences, with support for error tolerance and end-gap-free alignment.

Core Algorithm

  • Implements a Needleman-Wunsch dynamic programming algorithm optimized for speed and memory efficiency.
  • Uses bit-packed encoding (uint64) to store score, path length, and gap status in a compact form.
  • Leverages diagonal banding to restrict computation only within the allowed error margin, reducing time and space complexity.

Scoring Scheme

  • Match: +1 point
  • Mismatch or gap (indel): 0 points

Key Functions

  1. FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)

    • Computes LCS score and alignment length between raw byte sequences.
    • If endgapfree is true, ignores leading/trailing gaps (useful for read alignment).
    • Returns (score, length, end_position); end_position marks where the LCS ends in sequence A.
    • Returns -1, -1, -1 if the actual error count exceeds maxError.
  2. FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)

    • Wrapper for FastLCSEGFScoreByte with end-gap-free mode enabled by default.
    • Designed for standard biosequence inputs.
  3. FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)

    • Computes standard LCS (including end gaps). Returns (score, alignment_length).

Features

  • Error-bounded: Supports maxError = -1 (unlimited) or a fixed max number of mismatches + gaps.
  • Memory-efficient: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
  • IUPAC-aware: Uses obiseq.SameIUPACNuc() to handle ambiguous nucleotide codes (e.g., R, Y).
  • Optimized for short reads: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).

Use Cases

  • Molecular barcode/UMI clustering
  • Read-to-reference alignment in amplicon sequencing
  • Similarity filtering of biological sequences