- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
1.8 KiB
Semantic Description of obikmer Package Functionalities
The obikmer package provides disk-backed operations on k-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the KmerSetGroup type.
Core Features
-
Sequence-to-k-mer Indexing: Sequences are converted into k-mers (of length
k) and stored in a group of sets (KmerSetGroup), with one set per sequence. Minimizer-based sampling (parameterm) reduces redundancy. -
Set Operations on Disk: Efficient disk-resident implementations of standard set operations:
Union: Merges all k-mers from selected sets.Intersect: Retains only k-mers present in all input sets.Difference(A \ B): Keeps k-mers present in set A but not in B.QuorumAtLeast(r): Returns k-mers appearing in ≥rsets (generalizes union (r=1) and intersection (r=n)).
-
Consistency Guarantees: Operations obey mathematical identities (e.g.,
|A ∪ B| = |A| + |B| − |A ∩ B|), validated via unit tests. -
Similarity & Distance Metrics:
JaccardDistanceMatrix(): Computes pairwise Jaccard distances (1 − similarity) between all sets.JaccardSimilarityMatrix(): Computes pairwise Jaccard similarities (|A ∩ B| / |A ∪ B|).- Identical sets yield distance =
0.0, disjoint ones give1.0; similarity is complementary.
Design Principles
- Temporary Directory Usage: All operations use OS temp dirs for isolation and cleanup.
- Testing-Focused API: Helper functions (
buildGroupFromSeqs,collectKmers) simplify test setup. - Scalability: Disk-backed design avoids memory overflow for large sequence collections.
This package enables robust, reproducible k-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.