mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
45 lines
2.2 KiB
Markdown
45 lines
2.2 KiB
Markdown
|
|
# K-mer Set Group Builder — Semantic Description
|
||
|
|
|
||
|
|
This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
|
||
|
|
|
||
|
|
### Core Functionalities
|
||
|
|
|
||
|
|
- **K-mer Set Group Construction**:
|
||
|
|
`NewKmerSetGroupBuilder` creates a builder configured with:
|
||
|
|
- *k* (k-mer length),
|
||
|
|
- *m* (minimal unique substring for partitioning),
|
||
|
|
- number of sets (`nSets`),
|
||
|
|
- and optional parameters like `WithMinFrequency`.
|
||
|
|
|
||
|
|
- **Sequence Ingestion**:
|
||
|
|
Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
|
||
|
|
- Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
|
||
|
|
- Deduplicated and optionally filtered by occurrence frequency.
|
||
|
|
|
||
|
|
- **Persistence & Round-Trip**:
|
||
|
|
`builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).
|
||
|
|
`OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
|
||
|
|
|
||
|
|
- **Metadata & Attributes**:
|
||
|
|
Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
|
||
|
|
|
||
|
|
- **Efficient Iteration**:
|
||
|
|
The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
|
||
|
|
|
||
|
|
- **Frequency Filtering**:
|
||
|
|
`WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
|
||
|
|
|
||
|
|
- **Multi-set Support**:
|
||
|
|
Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
|
||
|
|
|
||
|
|
### Testing Coverage
|
||
|
|
|
||
|
|
Comprehensive unit tests validate:
|
||
|
|
- Basic construction & correctness,
|
||
|
|
- Multi-sequence ingestion and deduplication,
|
||
|
|
- Frequency-based inclusion/exclusion logic,
|
||
|
|
- Cross-set isolation (`nSets > 1`),
|
||
|
|
- Metadata round-trip integrity.
|
||
|
|
|
||
|
|
This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
|