Files
obitools4/autodoc/docmd/pkg/obistats/kmeans.md
T

35 lines
2.0 KiB
Markdown
Raw Normal View History

2026-04-07 08:36:50 +02:00
# `obistats` Package: K-Means Clustering Implementation
The `obistats` package provides a concurrent, type-generic implementation of the **K-means clustering algorithm** for numerical datasets.
## Core Utilities
- `SquareDist` / `EuclideanDist`: Compute squared and Euclidean distances between vectors (generic over `float64` or `int`).
- `DefaultRG`: Returns a seeded random number generator (`*rand.Rand`) for reproducibility control.
## Data Structure
- `KmeansClustering`: Encapsulates dataset (`*obiutils.Matrix[float64]`), cluster assignments, centers, and metadata (sizes, distances to nearest center).
- Supports dynamic addition of clusters via `AddACenter()`.
## Initialization & Management
- `MakeKmeansClustering`: Initializes the structure with data, number of clusters *k*, and RNG.
- `SetCenterTo`, `AddACenter`: Assign or grow centers; uses **k-means++**-inspired weighted sampling for new centers.
- `ResetEmptyCenters`: Reinitializes empty clusters using distance-weighted sampling.
## Core Algorithm Steps
- `AssignToClass`: Parallel assignment of points to nearest centers (uses goroutines + mutex).
- `ComputeCenters`: Computes new cluster centroids *as the closest original data point* to the arithmetic mean (robust for non-Euclidean spaces).
- `Run`: Executes iterative refinement until convergence (`max_cycle` iterations or inertia drop ≤ threshold).
## Accessors & Diagnostics
- `K()`, `N()`, `Dimension()`: Return number of clusters, dataset size, and feature dimension.
- `Inertia()`: Sum of squared distances to assigned centers (convergence metric).
- `Centers`, `Classes`, `Sizes`: Expose internal clustering state.
## Design Highlights
- Fully concurrent (goroutine-based) for performance.
- Generic distance functions support both `int` and `float64`.
- Explicit handling of edge cases (empty clusters, convergence).
- Logging via `logrus` for debugging (`obilog.Warnf`).
> *Note: High-level wrapper functions (e.g., standalone `Kmeans`) are commented out but outline intended API usage.*