Files
obitools4/autodoc/docmd/pkg_obialign.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

3.9 KiB

obialign Package: Semantic Overview

The obialign package delivers high-performance, memory-efficient utilities for biological sequence alignment within the OBITools4 ecosystem. It targets amplicon and metagenomic data processing, emphasizing speed, numerical stability, and scalability.


Core Functionalities

1. Sequence Encoding & Decoding

  • Encode4bits: Converts IUPAC nucleotides (including ambiguous codes like R, Y, N) into compact 4-bit representations.
  • Supports bitwise operations for rapid comparison (e.g., via _FourBitsBaseCode).
  • Handles gaps (./-) and invalid characters as 0b0000.

2. Alignment Scoring & Probability Models

  • _MatchRatio, _NucPartMatch: Compute match likelihoods using bitwise overlap of encoded bases.
  • Log-space helpers (_Logaddexp, _Logdiffexp) ensure numerical stability in probabilistic scoring.
  • Quality-aware scores via precomputed matrices (_NucScorePartMatch{Match,Mismatch}), incorporating Phred scores and base composition priors.

3. Dynamic Programming (DP) Backtracking

  • _Backtracking: Reconstructs optimal alignment paths from precomputed matrices.
    • Encodes diagonal runs and gap segments as alternating (offset, length) pairs.
  • Optimized for batch reuse of path buffers and minimal allocations.

4. Longest Common Subsequence (LCS) with Error Tolerance

  • FastLCSEGFScore, FastLCSScore: Compute LCS under bounded error (maxError) and optional end-gap-free mode.
    • Uses diagonal banding for efficiency.
  • Designed for rapid similarity filtering (e.g., UMI/OTU clustering).

5. Single-Edit Distance Detection

  • D1Or0: Determines if two sequences are identical or differ by exactly one edit (substitution/indel).
    • Early termination on length mismatch or multiple divergences.
    • Critical for error correction and dereplication.

6. Local Pattern Matching

  • LocatePattern: Finds optimal approximate match of a short query (e.g., primer) in longer sequence.
    • End-gap-free alignment, using DP with mismatch/gap penalty -1.
    • Returns start/end positions and error count.

7. Paired-End Read Alignment

  • PEAlign, _FillMatrixPeLeftAlign, etc.: Global alignment of paired-end reads with affine gap penalties.
    • Supports three modes: PELeftAlign, PERightAlign, and PECenterAlign (for overlaps).
    • Integrates k-mer pre-screening (obikmer.Index4mer) for fast overlap estimation.
    • Quality-aware scoring via _PairingScorePeAlign.

8. Consensus & Alignment Reconstruction

  • BuildAlignment, _BuildAlignment: Reconstruct aligned sequences from DP path, reusing buffers.
  • BuildQualityConsensus: Generates consensus with quality-aware base selection:
    • Mismatches resolved by higher-quality call or IUPAC ambiguity.
    • Optional mismatch statistics recording.

9. Memory & Performance Optimization

  • PEAlignArena: Reusable memory arena for matrices, paths, and buffers.
    • Reduces GC pressure in high-throughput pipelines.
  • Compact uint64 encoding for scores, path lengths, and flags (encodeValues, _incscore).
    • Enables fast comparisons during DP.

Design Principles

  • IUPAC-aware: Handles ambiguous nucleotides via obiseq.SameIUPACNuc.
  • Thread-safe initialization: _InitDNAScoreMatrix uses mutex guards.
  • No allocations in hot paths: Buffers reused across calls (arena pattern).
  • End-gap flexibility: Critical for read merging and primer trimming.

Use Cases

Functionality Application
FastLCSEGFScore, D1Or0 OTU/ASV clustering, UMI deduplication
LocatePattern, PEAlign Primer trimming, read merging in metabarcoding
BuildQualityConsensus, _Backtracking Consensus generation post-merge

Designed for integration into large-scale NGS pipelines—especially where speed, memory footprint, and numerical robustness are critical.