mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
45 lines
2.5 KiB
Markdown
45 lines
2.5 KiB
Markdown
# `obikmer` Package: Disk-Based K-mer Set Group Management
|
|
|
|
The `obikmer` package provides a streaming, disk-backed implementation for managing collections of *k*-mer sets (called **K-mer Set Groups**), optimized for large-scale metagenomic or genomic analyses.
|
|
|
|
### Core Concepts
|
|
- A **KmerSetGroup** stores *N* disjoint sets of sorted *k*-mers, partitioned into *P* files per set.
|
|
- Each group is defined by immutable parameters: `k` (*mer size), `m* (minimizer size), and *P* partitions.
|
|
- Data is stored on disk as `.kdi` files (sorted k-mers) with optional sparse indices (`.kdx`) for fast lookup.
|
|
- Metadata is serialized in TOML format (`metadata.toml`), supporting both group-level and per-set attributes.
|
|
|
|
### Key Functionalities
|
|
|
|
#### 1. **Lifecycle Management**
|
|
- `OpenKmerSetGroup(directory)` loads an existing index in read-only mode.
|
|
- `NewFilteredKmerSetGroup(...)` constructs a new group (e.g., after filtering).
|
|
- `SaveMetadata()` persists metadata changes to disk.
|
|
|
|
#### 2. **Accessors & Metadata**
|
|
- Basic properties: `K()`, `M()`, `Partitions()`, `Size()` (i.e., *N*), and group ID.
|
|
- Attribute API: get/set/delete user-defined metadata (group-level or per-set).
|
|
- Supports type coercion (`GetIntAttribute`, `GetStringAttribute`).
|
|
|
|
#### 3. **Membership & Iteration**
|
|
- `Contains(setIndex, kmer)` checks presence using indexed binary search + linear scan across all partitions (parallelized).
|
|
- `Iterator(setIndex)` yields sorted *k*-mers via k-way merge of partition readers.
|
|
|
|
#### 4. **Similarity & Distance Metrics**
|
|
- `JaccardDistanceMatrix()` and `JaccardSimilarityMatrix()`: compute pairwise metrics in a streaming fashion.
|
|
- Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
|
|
|
|
#### 5. **Set Management**
|
|
- `CopySetsByIDTo(ids, destDir)` copies selected sets (with metadata) to another group.
|
|
- Supports compatibility checks and optional overwriting (`force`).
|
|
- `RemoveSetByID(id)` deletes a set, renumbers remaining sets for contiguous indices.
|
|
- Glob pattern matching: `MatchSetIDs(patterns)` resolves IDs like `"sample_*"`.
|
|
|
|
#### 6. **Compatibility & Utility**
|
|
- `IsCompatibleWith(other)` verifies same `(k, m, partitions)`.
|
|
- Helper methods: `PartitionPath`, `Spectrum(...)`, and spectrum file I/O.
|
|
|
|
### Design Highlights
|
|
- **Streaming**: Operations avoid loading full datasets into memory.
|
|
- **Immutability after creation** ensures consistency; modifications require explicit save operations.
|
|
- Thread-safe for concurrent partition processing (via `sync.Mutex`/`WaitGroup`).
|