Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_disk.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

45 lines
2.5 KiB
Markdown

# `obikmer` Package: Disk-Based K-mer Set Group Management
The `obikmer` package provides a streaming, disk-backed implementation for managing collections of *k*-mer sets (called **K-mer Set Groups**), optimized for large-scale metagenomic or genomic analyses.
### Core Concepts
- A **KmerSetGroup** stores *N* disjoint sets of sorted *k*-mers, partitioned into *P* files per set.
- Each group is defined by immutable parameters: `k` (*mer size), `m* (minimizer size), and *P* partitions.
- Data is stored on disk as `.kdi` files (sorted k-mers) with optional sparse indices (`.kdx`) for fast lookup.
- Metadata is serialized in TOML format (`metadata.toml`), supporting both group-level and per-set attributes.
### Key Functionalities
#### 1. **Lifecycle Management**
- `OpenKmerSetGroup(directory)` loads an existing index in read-only mode.
- `NewFilteredKmerSetGroup(...)` constructs a new group (e.g., after filtering).
- `SaveMetadata()` persists metadata changes to disk.
#### 2. **Accessors & Metadata**
- Basic properties: `K()`, `M()`, `Partitions()`, `Size()` (i.e., *N*), and group ID.
- Attribute API: get/set/delete user-defined metadata (group-level or per-set).
- Supports type coercion (`GetIntAttribute`, `GetStringAttribute`).
#### 3. **Membership & Iteration**
- `Contains(setIndex, kmer)` checks presence using indexed binary search + linear scan across all partitions (parallelized).
- `Iterator(setIndex)` yields sorted *k*-mers via k-way merge of partition readers.
#### 4. **Similarity & Distance Metrics**
- `JaccardDistanceMatrix()` and `JaccardSimilarityMatrix()`: compute pairwise metrics in a streaming fashion.
- Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
#### 5. **Set Management**
- `CopySetsByIDTo(ids, destDir)` copies selected sets (with metadata) to another group.
- Supports compatibility checks and optional overwriting (`force`).
- `RemoveSetByID(id)` deletes a set, renumbers remaining sets for contiguous indices.
- Glob pattern matching: `MatchSetIDs(patterns)` resolves IDs like `"sample_*"`.
#### 6. **Compatibility & Utility**
- `IsCompatibleWith(other)` verifies same `(k, m, partitions)`.
- Helper methods: `PartitionPath`, `Spectrum(...)`, and spectrum file I/O.
### Design Highlights
- **Streaming**: Operations avoid loading full datasets into memory.
- **Immutability after creation** ensures consistency; modifications require explicit save operations.
- Thread-safe for concurrent partition processing (via `sync.Mutex`/`WaitGroup`).