Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_disk.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.5 KiB

obikmer Package: Disk-Based K-mer Set Group Management

The obikmer package provides a streaming, disk-backed implementation for managing collections of k-mer sets (called K-mer Set Groups), optimized for large-scale metagenomic or genomic analyses.

Core Concepts

  • A KmerSetGroup stores N disjoint sets of sorted k-mers, partitioned into P files per set.
  • Each group is defined by immutable parameters: k (mer size), `m (minimizer size), and P partitions.
  • Data is stored on disk as .kdi files (sorted k-mers) with optional sparse indices (.kdx) for fast lookup.
  • Metadata is serialized in TOML format (metadata.toml), supporting both group-level and per-set attributes.

Key Functionalities

1. Lifecycle Management

  • OpenKmerSetGroup(directory) loads an existing index in read-only mode.
  • NewFilteredKmerSetGroup(...) constructs a new group (e.g., after filtering).
  • SaveMetadata() persists metadata changes to disk.

2. Accessors & Metadata

  • Basic properties: K(), M(), Partitions(), Size() (i.e., N), and group ID.
  • Attribute API: get/set/delete user-defined metadata (group-level or per-set).
    • Supports type coercion (GetIntAttribute, GetStringAttribute).

3. Membership & Iteration

  • Contains(setIndex, kmer) checks presence using indexed binary search + linear scan across all partitions (parallelized).
  • Iterator(setIndex) yields sorted k-mers via k-way merge of partition readers.

4. Similarity & Distance Metrics

  • JaccardDistanceMatrix() and JaccardSimilarityMatrix(): compute pairwise metrics in a streaming fashion.
    • Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.

5. Set Management

  • CopySetsByIDTo(ids, destDir) copies selected sets (with metadata) to another group.
    • Supports compatibility checks and optional overwriting (force).
  • RemoveSetByID(id) deletes a set, renumbers remaining sets for contiguous indices.
  • Glob pattern matching: MatchSetIDs(patterns) resolves IDs like "sample_*".

6. Compatibility & Utility

  • IsCompatibleWith(other) verifies same (k, m, partitions).
  • Helper methods: PartitionPath, Spectrum(...), and spectrum file I/O.

Design Highlights

  • Streaming: Operations avoid loading full datasets into memory.
  • Immutability after creation ensures consistency; modifications require explicit save operations.
  • Thread-safe for concurrent partition processing (via sync.Mutex/WaitGroup).