Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_builder_test.md
T

45 lines
2.2 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# K-mer Set Group Builder — Semantic Description
This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
### Core Functionalities
- **K-mer Set Group Construction**:
`NewKmerSetGroupBuilder` creates a builder configured with:
- *k* (k-mer length),
- *m* (minimal unique substring for partitioning),
- number of sets (`nSets`),
- and optional parameters like `WithMinFrequency`.
- **Sequence Ingestion**:
Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
- Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
- Deduplicated and optionally filtered by occurrence frequency.
- **Persistence & Round-Trip**:
`builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).
`OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
- **Metadata & Attributes**:
Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
- **Efficient Iteration**:
The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
- **Frequency Filtering**:
`WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
- **Multi-set Support**:
Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
### Testing Coverage
Comprehensive unit tests validate:
- Basic construction & correctness,
- Multi-sequence ingestion and deduplication,
- Frequency-based inclusion/exclusion logic,
- Cross-set isolation (`nSets > 1`),
- Metadata round-trip integrity.
This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).