Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

One of the most important steps of the K-means analysis is defining the initial location of the cluster centers. The way K-means clustering works is that it starts by assigning k data points to be the initial cluster centers, then using the specified K-means algorithm to iteratively assign other data points to these clusters and update the cluster centers. This iterative process continues until no data points change cluster assignment and no cluster centers change location (this is called “convergence”).

However, the selection of the initial clusters may strongly influence how well the data are clustered. This is due to the fact that the K-means algorithm can get “trapped” by local minima when trying to optimize cluster assignments. Classical approaches to address this problem involve selecting k data points at random to serve as the cluster centers, performing the clustering analysis, then repeating the entire analysis numerous times by choosing a new set of points to use as the initial k cluster centers each time. The all of the analyses have been performed, the “best” set of initial clusters is selected. But this means that the analysis must be performed numerous times (common values range anywhere from 10 to 100 repeats of the entire clustering analysis). This can be computationally intensive, and can take quite a long time depending on the complexity of the data.

The K-means++ approach of initialization was designed to help address this issue of poor initialization coupled with the need to repeat the entire analysis numerous times. The way K-means++ initialization works is the following:

1.Define the number of clusters (k) to use for the analysis

2.Select one data point in the input data as the first cluster center

3.Calculate the distance, D(x) of every other point in the input data from this cluster center

4.Compute a probability distribution, P(x) for every data point that is proportional to the square of its distance from the existing cluster center. Thus, points farther away from the existing cluster center will have a higher probability than points closer to the existing cluster center (proportional to D(x)2).

5.Select the next cluster center using the probability distribution

6.Repeat steps 3-5 until all k cluster centers have been specified

Looking at these steps, it may be obvious that this process of initialization requires a bit more computation than simply selecting k cluster centers at random from the data. However, the tradeoff is that the analysis does not have to be repeated in its entirety when this K-means++ initialization process is used! Moreover, the K-means clustering analysis typically converges much faster when K-means++ initialization is used, resulting in overall faster analysis times. Finally, recall that each cluster center after the initial center is selected based on its distance from all existing defined cluster centers. This means that the initial cluster centers are already very spread out, which often results in better clustering results with lower within-cluster variance.

The advantages of the K-means++ initialization approach are clear, and Prism only uses this approach when performing K-means clustering.

© 1995-2019 GraphPad Software, LLC. All rights reserved.