mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.5 KiB
2.5 KiB
obikmer Package: Disk-Based K-mer Set Group Management
The obikmer package provides a streaming, disk-backed implementation for managing collections of k-mer sets (called K-mer Set Groups), optimized for large-scale metagenomic or genomic analyses.
Core Concepts
- A KmerSetGroup stores N disjoint sets of sorted k-mers, partitioned into P files per set.
- Each group is defined by immutable parameters:
k(mer size), `m (minimizer size), and P partitions. - Data is stored on disk as
.kdifiles (sorted k-mers) with optional sparse indices (.kdx) for fast lookup. - Metadata is serialized in TOML format (
metadata.toml), supporting both group-level and per-set attributes.
Key Functionalities
1. Lifecycle Management
OpenKmerSetGroup(directory)loads an existing index in read-only mode.NewFilteredKmerSetGroup(...)constructs a new group (e.g., after filtering).SaveMetadata()persists metadata changes to disk.
2. Accessors & Metadata
- Basic properties:
K(),M(),Partitions(),Size()(i.e., N), and group ID. - Attribute API: get/set/delete user-defined metadata (group-level or per-set).
- Supports type coercion (
GetIntAttribute,GetStringAttribute).
- Supports type coercion (
3. Membership & Iteration
Contains(setIndex, kmer)checks presence using indexed binary search + linear scan across all partitions (parallelized).Iterator(setIndex)yields sorted k-mers via k-way merge of partition readers.
4. Similarity & Distance Metrics
JaccardDistanceMatrix()andJaccardSimilarityMatrix(): compute pairwise metrics in a streaming fashion.- Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
5. Set Management
CopySetsByIDTo(ids, destDir)copies selected sets (with metadata) to another group.- Supports compatibility checks and optional overwriting (
force).
- Supports compatibility checks and optional overwriting (
RemoveSetByID(id)deletes a set, renumbers remaining sets for contiguous indices.- Glob pattern matching:
MatchSetIDs(patterns)resolves IDs like"sample_*".
6. Compatibility & Utility
IsCompatibleWith(other)verifies same(k, m, partitions).- Helper methods:
PartitionPath,Spectrum(...), and spectrum file I/O.
Design Highlights
- Streaming: Operations avoid loading full datasets into memory.
- Immutability after creation ensures consistency; modifications require explicit save operations.
- Thread-safe for concurrent partition processing (via
sync.Mutex/WaitGroup).