mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2.0 KiB
2.0 KiB
obistats Package: K-Means Clustering Implementation
The obistats package provides a concurrent, type-generic implementation of the K-means clustering algorithm for numerical datasets.
Core Utilities
SquareDist/EuclideanDist: Compute squared and Euclidean distances between vectors (generic overfloat64orint).DefaultRG: Returns a seeded random number generator (*rand.Rand) for reproducibility control.
Data Structure
KmeansClustering: Encapsulates dataset (*obiutils.Matrix[float64]), cluster assignments, centers, and metadata (sizes, distances to nearest center).- Supports dynamic addition of clusters via
AddACenter().
Initialization & Management
MakeKmeansClustering: Initializes the structure with data, number of clusters k, and RNG.SetCenterTo,AddACenter: Assign or grow centers; uses k-means++-inspired weighted sampling for new centers.ResetEmptyCenters: Reinitializes empty clusters using distance-weighted sampling.
Core Algorithm Steps
AssignToClass: Parallel assignment of points to nearest centers (uses goroutines + mutex).ComputeCenters: Computes new cluster centroids as the closest original data point to the arithmetic mean (robust for non-Euclidean spaces).Run: Executes iterative refinement until convergence (max_cycleiterations or inertia drop ≤ threshold).
Accessors & Diagnostics
K(),N(),Dimension(): Return number of clusters, dataset size, and feature dimension.Inertia(): Sum of squared distances to assigned centers (convergence metric).Centers,Classes,Sizes: Expose internal clustering state.
Design Highlights
- Fully concurrent (goroutine-based) for performance.
- Generic distance functions support both
intandfloat64. - Explicit handling of edge cases (empty clusters, convergence).
- Logging via
logrusfor debugging (obilog.Warnf).
Note: High-level wrapper functions (e.g., standalone
Kmeans) are commented out but outline intended API usage.