mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
16 lines
1.3 KiB
Markdown
16 lines
1.3 KiB
Markdown
|
|
# Semantic Description of `obialign` Package
|
||
|
|
|
||
|
|
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
|
||
|
|
|
||
|
|
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
|
||
|
|
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
|
||
|
|
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
|
||
|
|
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
|
||
|
|
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
|
||
|
|
- **Lookup tables**:
|
||
|
|
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
|
||
|
|
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
|
||
|
|
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
|
||
|
|
|
||
|
|
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
|