mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,28 @@
|
||||
# Semantic Description of `obikmer` Package Functionalities
|
||||
|
||||
The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
|
||||
|
||||
- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
|
||||
- `Union`: Merges all *k*-mers from selected sets.
|
||||
- `Intersect`: Retains only *k*-mers present in all input sets.
|
||||
- `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
|
||||
- `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
|
||||
|
||||
- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.
|
||||
|
||||
- **Similarity & Distance Metrics**:
|
||||
- `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
|
||||
- `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
|
||||
- Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
|
||||
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
|
||||
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
|
||||
|
||||
This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
|
||||
Reference in New Issue
Block a user