Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_disk_ops.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

1.9 KiB
Raw Blame History

Semantic Description of obikmer Set Operations

This Go package implements scalable set operations over collections of k-mers stored in disk-backed, sorted structures (.kdi files). A KmerSetGroup represents a group of N disjoint sets (e.g., per-sample or per-partition), each containing sorted unique k-mers.

Core Set Operations

  • Union(): Computes the union across all N sets — a k-mer appears in output if present in ≥1 input set.
  • Intersect(): Computes the intersection — a k-mer appears only if present in all sets.
  • Difference(): Computes set₀ \ (set₁ setₙ₋₁) — keeps k-mers unique to the first set.
  • QuorumAtLeast(q): Returns k-mers present in ≥ q sets.
  • QuorumExactly(q): Returns k-mers present in exactly q sets.
  • QuorumAtMost(q): Returns k-mers present in ≤ q sets.

Pairwise Group Operations

  • UnionWith(other) / IntersectWith(other): Performs per-set binary operations between two compatible groups (same k, m, partitions, size). Result has N sets: setᵢ = this.setᵢ ⊕ other.setᵢ, where ⊕ is union or intersection.

Implementation Highlights

  • Partitioned & Parallelized: Each operation processes partitions in parallel using runtime.NumCPU() workers.
  • Streaming K-way Merge: Uses efficient sorted-stream merging (via KWayMerge) to avoid loading full sets into memory.
  • Quorum Filtering: Counts occurrences per k-mer across partitions by merging sorted streams and tallying hits.
  • Compatibility Check: Ensures groups share metadata (k, m, partitions) before pairwise operations.
  • Disk Output: All results materialized as new KmerSetGroup in a directory, with per-partition .kdi files and metadata.

All operations preserve sorted order and support large-scale genomic datasets via streaming, partitioning, and minimal memory footprint.