autodoc/docmd/pkg/obikmer/kmer_set_disk_ops_test.md

# Semantic Description of `obikmer` Package Functionalities

The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.

## Core Features

- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.

- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
  - `Union`: Merges all *k*-mers from selected sets.
  - `Intersect`: Retains only *k*-mers present in all input sets.
  - `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
  - `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).

- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.

- **Similarity & Distance Metrics**:
  - `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
  - `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
  - Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.

## Design Principles

- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.

This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
-											⬆️ version bump to v4.5
										
										
											2026-04-07 08:36:50 +02:00
+								# Semantic Description of `obikmer` Package Functionalities
 								The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
 								## Core Features
 								- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
 								- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
 								  - `Union`: Merges all *k*-mers from selected sets.
 								  - `Intersect`: Retains only *k*-mers present in all input sets.
 								  - `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
 								  - `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
 								- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.
 								- **Similarity & Distance Metrics**:
 								  - `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
 								  - `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
 								  - Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
 								## Design Principles
 								- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
 								- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
 								- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
 								This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.