mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
35 lines
2.0 KiB
Markdown
35 lines
2.0 KiB
Markdown
# `obistats` Package: K-Means Clustering Implementation
|
|
|
|
The `obistats` package provides a concurrent, type-generic implementation of the **K-means clustering algorithm** for numerical datasets.
|
|
|
|
## Core Utilities
|
|
- `SquareDist` / `EuclideanDist`: Compute squared and Euclidean distances between vectors (generic over `float64` or `int`).
|
|
- `DefaultRG`: Returns a seeded random number generator (`*rand.Rand`) for reproducibility control.
|
|
|
|
## Data Structure
|
|
- `KmeansClustering`: Encapsulates dataset (`*obiutils.Matrix[float64]`), cluster assignments, centers, and metadata (sizes, distances to nearest center).
|
|
- Supports dynamic addition of clusters via `AddACenter()`.
|
|
|
|
## Initialization & Management
|
|
- `MakeKmeansClustering`: Initializes the structure with data, number of clusters *k*, and RNG.
|
|
- `SetCenterTo`, `AddACenter`: Assign or grow centers; uses **k-means++**-inspired weighted sampling for new centers.
|
|
- `ResetEmptyCenters`: Reinitializes empty clusters using distance-weighted sampling.
|
|
|
|
## Core Algorithm Steps
|
|
- `AssignToClass`: Parallel assignment of points to nearest centers (uses goroutines + mutex).
|
|
- `ComputeCenters`: Computes new cluster centroids *as the closest original data point* to the arithmetic mean (robust for non-Euclidean spaces).
|
|
- `Run`: Executes iterative refinement until convergence (`max_cycle` iterations or inertia drop ≤ threshold).
|
|
|
|
## Accessors & Diagnostics
|
|
- `K()`, `N()`, `Dimension()`: Return number of clusters, dataset size, and feature dimension.
|
|
- `Inertia()`: Sum of squared distances to assigned centers (convergence metric).
|
|
- `Centers`, `Classes`, `Sizes`: Expose internal clustering state.
|
|
|
|
## Design Highlights
|
|
- Fully concurrent (goroutine-based) for performance.
|
|
- Generic distance functions support both `int` and `float64`.
|
|
- Explicit handling of edge cases (empty clusters, convergence).
|
|
- Logging via `logrus` for debugging (`obilog.Warnf`).
|
|
|
|
> *Note: High-level wrapper functions (e.g., standalone `Kmeans`) are commented out but outline intended API usage.*
|