Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_disk_ops_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.8 KiB
Raw Blame History

Semantic Description of obikmer Package Functionalities

The obikmer package provides disk-backed operations on k-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the KmerSetGroup type.

Core Features

  • Sequence-to-k-mer Indexing: Sequences are converted into k-mers (of length k) and stored in a group of sets (KmerSetGroup), with one set per sequence. Minimizer-based sampling (parameter m) reduces redundancy.

  • Set Operations on Disk: Efficient disk-resident implementations of standard set operations:

    • Union: Merges all k-mers from selected sets.
    • Intersect: Retains only k-mers present in all input sets.
    • Difference (A \ B): Keeps k-mers present in set A but not in B.
    • QuorumAtLeast(r): Returns k-mers appearing in ≥r sets (generalizes union (r=1) and intersection (r=n)).
  • Consistency Guarantees: Operations obey mathematical identities (e.g., |A B| = |A| + |B| |A ∩ B|), validated via unit tests.

  • Similarity & Distance Metrics:

    • JaccardDistanceMatrix(): Computes pairwise Jaccard distances (1 similarity) between all sets.
    • JaccardSimilarityMatrix(): Computes pairwise Jaccard similarities (|A ∩ B| / |A B|).
    • Identical sets yield distance = 0.0, disjoint ones give 1.0; similarity is complementary.

Design Principles

  • Temporary Directory Usage: All operations use OS temp dirs for isolation and cleanup.
  • Testing-Focused API: Helper functions (buildGroupFromSeqs, collectKmers) simplify test setup.
  • Scalability: Disk-backed design avoids memory overflow for large sequence collections.

This package enables robust, reproducible k-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.