Files
obitools4/autodoc/docmd/pkg/obialign/fourbitsencode.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.3 KiB

Semantic Description of obialign Package

The obialign package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.

  • Core functionality: Encodes IUPAC nucleotide symbols (including ambiguous codes like R, Y, N) into compact 4-bit binary representations.
  • Binary encoding scheme: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
  • Ambiguity support: Codes like R (A/G) set both corresponding bits (0b0101). Fully ambiguous N sets all four bits (0b1111).
  • Gap/missing handling: Symbols . and -, as well as non-nucleotide characters, map to 0b0000.
  • Memory efficiency: The encoding avoids allocations via optional buffer reuse.
  • Lookup tables:
    • _FourBitsBaseCode: Maps ASCII nucleotide characters (lowercased via nuc & 31) to their binary code.
    • _FourBitsBaseDecode: Inverse mapping for human-readable output (not exported, used internally).
  • Integration: Works with obiseq.BioSequence, a generic biological sequence container from the OBITools4 ecosystem.

The Encode4bits function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.