mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
29 lines
1.8 KiB
Markdown
29 lines
1.8 KiB
Markdown
|
|
# Semantic Description of `obikmer` Package Functionalities
|
|||
|
|
|
|||
|
|
The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
|
|||
|
|
|
|||
|
|
## Core Features
|
|||
|
|
|
|||
|
|
- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
|
|||
|
|
|
|||
|
|
- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
|
|||
|
|
- `Union`: Merges all *k*-mers from selected sets.
|
|||
|
|
- `Intersect`: Retains only *k*-mers present in all input sets.
|
|||
|
|
- `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
|
|||
|
|
- `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
|
|||
|
|
|
|||
|
|
- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.
|
|||
|
|
|
|||
|
|
- **Similarity & Distance Metrics**:
|
|||
|
|
- `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
|
|||
|
|
- `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
|
|||
|
|
- Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
|
|||
|
|
|
|||
|
|
## Design Principles
|
|||
|
|
|
|||
|
|
- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
|
|||
|
|
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
|
|||
|
|
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
|
|||
|
|
|
|||
|
|
This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
|