Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_set_builder_test.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.2 KiB

K-mer Set Group Builder — Semantic Description

This Go module (obikmer) provides a disk-backed builder and accessor for managing k-mer sets across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical k-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.

Core Functionalities

  • K-mer Set Group Construction:
    NewKmerSetGroupBuilder creates a builder configured with:

    • k (k-mer length),
    • m (minimal unique substring for partitioning),
    • number of sets (nSets),
    • and optional parameters like WithMinFrequency.
  • Sequence Ingestion:
    Sequences are added per set via AddSequence(setID, bioseq). Internally:

    • Canonical k-mers are extracted (using IterCanonicalKmers),
    • Deduplicated and optionally filtered by occurrence frequency.
  • Persistence & Round-Trip:
    builder.Close() materializes the k-mer sets to disk (in temp or specified directory).
    OpenKmerSetGroup(dir) reloads them — preserving all metadata and structure.

  • Metadata & Attributes:
    Supports custom identifiers (SetId) and key-value attributes (e.g., "organism": "test"), saved to disk via SaveMetadata.

  • Efficient Iteration:
    The iterator (ksg.Iterator(setID)) yields sorted, deduplicated canonical k-mers — using a k-way merge across internal partitions.

  • Frequency Filtering:
    WithMinFrequency(n) ensures only k-mers appearing ≥n times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).

  • Multi-set Support:
    Handles multiple independent k-mer sets (e.g., per sample or taxonomic group), verified via Size() and indexed access (Len(setID)).

Testing Coverage

Comprehensive unit tests validate:

  • Basic construction & correctness,
  • Multi-sequence ingestion and deduplication,
  • Frequency-based inclusion/exclusion logic,
  • Cross-set isolation (nSets > 1),
  • Metadata round-trip integrity.

This module is designed for scalable, reproducible k-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).