# Semantic Description of `obikmer` Package Functionalities

The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.

## Core Features

- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.

- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
  - `Union`: Merges all *k*-mers from selected sets.
  - `Intersect`: Retains only *k*-mers present in all input sets.
  - `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
  - `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).

- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.

- **Similarity & Distance Metrics**:
  - `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
  - `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
  - Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.

## Design Principles

- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.

This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.