- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.2 KiB
K-mer Set Group Builder — Semantic Description
This Go module (obikmer) provides a disk-backed builder and accessor for managing k-mer sets across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical k-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
Core Functionalities
-
K-mer Set Group Construction:
NewKmerSetGroupBuildercreates a builder configured with:- k (k-mer length),
- m (minimal unique substring for partitioning),
- number of sets (
nSets), - and optional parameters like
WithMinFrequency.
-
Sequence Ingestion:
Sequences are added per set viaAddSequence(setID, bioseq). Internally:- Canonical k-mers are extracted (using
IterCanonicalKmers), - Deduplicated and optionally filtered by occurrence frequency.
- Canonical k-mers are extracted (using
-
Persistence & Round-Trip:
builder.Close()materializes the k-mer sets to disk (in temp or specified directory).
OpenKmerSetGroup(dir)reloads them — preserving all metadata and structure. -
Metadata & Attributes:
Supports custom identifiers (SetId) and key-value attributes (e.g.,"organism": "test"), saved to disk viaSaveMetadata. -
Efficient Iteration:
The iterator (ksg.Iterator(setID)) yields sorted, deduplicated canonical k-mers — using a k-way merge across internal partitions. -
Frequency Filtering:
WithMinFrequency(n)ensures only k-mers appearing ≥n times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering). -
Multi-set Support:
Handles multiple independent k-mer sets (e.g., per sample or taxonomic group), verified viaSize()and indexed access (Len(setID)).
Testing Coverage
Comprehensive unit tests validate:
- Basic construction & correctness,
- Multi-sequence ingestion and deduplication,
- Frequency-based inclusion/exclusion logic,
- Cross-set isolation (
nSets > 1), - Metadata round-trip integrity.
This module is designed for scalable, reproducible k-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).