mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,35 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
1. **Nucleotide Encoding**
|
||||
- `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:
|
||||
`0→A`, `1→C`, `2→G`, `3→T/U`.
|
||||
Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).
|
||||
Uses a lookup table for O(1) performance.
|
||||
|
||||
2. **4-mer Encoding**
|
||||
- `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.
|
||||
Each k-mer is encoded as an unsigned byte (0–255), where each nucleotide contributes 2 bits.
|
||||
Supports optional buffer reuse for memory efficiency.
|
||||
|
||||
3. **4-mer Indexing**
|
||||
- `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0–255) to all its occurrence positions in the sequence.
|
||||
Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
|
||||
|
||||
4. **Fast Sequence Comparison**
|
||||
- `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.
|
||||
- Uses precomputed 4-mer index of a reference sequence and encodes the query.
|
||||
- Counts co-occurring 4-mers across all possible shifts (`refpos − queryPos`).
|
||||
- Computes raw and relative scores (normalized by alignment length).
|
||||
- Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-aware**: Supports buffer reuse to minimize allocations.
|
||||
- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).
|
||||
- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.
|
||||
|
||||
Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
|
||||
Reference in New Issue
Block a user