Features and functionality described on this page are available with Prism Enterprise. |
The way K-means clustering analysis works is by starting with a set of k cluster centers (where k is a number that you must specify). Then, an algorithm iteratively assigns each of the data points to its closest cluster center (using a distance metric that you must specify) and updates the location of the cluster center. The algorithm then repeats this process of assignment and cluster center updating until no more data points change their cluster assignment. Once this happens, the analysis is said to have reached “convergence”.
The previous section of this guide discussed the process by which Prism determines the initial location of the cluster centers for the analysis. This section describes the different algorithms that the analysis can use to conduct the iterative process of assignment and cluster location updating. There are actually three algorithms available within Prism:
1.Forgy-Lloyd
2.MacQueen
3.Hartigan-Wong
Note that this is not the order that they’re presented in the analysis parameters dialog. There, the Hartigan-Wong algorithm is listed first as this is the most modern and most reliable option (it is also the default for Prism). However, to best understand each of these options, it’s easier to start with the original Forgy-Lloyd algorithm.
The steps of this algorithm are as follows:
1.Assignment: assign each data point to the nearest cluster center
2.Update: Calculate the centroid of each cluster using the data assigned to it
3.Repeat Steps 1 and 2 until convergence or maximum number of iterations is reached
It’s actually that simple. First, all data points are assigned to their nearest cluster center, and then the cluster centers are updated. Then the data points are all assigned again to their nearest cluster center. Because the cluster centers were updated, some data points may change their assignment. If this happens, then the location of the cluster centers are updated (because the data in each cluster would have changed). This simply repeats until no data points change cluster assignment or a maximum number of iterations is achieved.
The challenge with this algorithm is that it does not update cluster centers until all points have been assigned. One major drawback of this approach is that - even though the algorithm is simpler - it can take many more iterations to reach convergence with the Forgy-Lloyd algorithm than with others due to the fact that the assignment of all points must be updated in each iteration. This is especially true if the initial clusters are far from optimal, though combining the Forgy-Lloyd algorithm with K-means++ initialization somewhat reduces this issue. However, even with K-means++ initialization, it’s still possible for an algorithm to get “stuck” in a local minima rather than finding the true optimal solution, and the Forgy-Lloyd algorithm is more prone to this problem than the other two alternatives that Prism provides.
The major improvement of the MacQueen algorithm over the Forgy-Lloyd algorithm is that it introduces a cluster center update step after each data point assignment rather than waiting for all points to be assigned. Here are the general steps:
1.For each data point
a.Assignment: assign this point to its nearest cluster center
b.Update: calculate the new center of the cluster the data point was assigned to
2.Repeat Step 1 for every data point
3.Repeat Steps 1 and 2 until convergence or maximum number of iterations is reached
This is a modest improvement over the Forgy-Lloyd algorithm as it allows the cluster centers to be dynamically updated after each data point assignment. This often translates to faster convergence over the Forgy-Lloyd algorithm. However, like the Forgy-Lloyd algorithm, the MacQueen algorithm can also more readily get stuck in local minima than the Hartigan-Wong algorithm.
The final algorithm (and the default used by Prism) is the Hartigan-Wong algorithm. This is the newest algorithm of the three that Prism offers, and is generally considered to be one of the most robust algorithms available for K-means clustering. The steps for the Hartigan-Wong algorithm are as follows:
1.For each data point
a.For each cluster center
i.Assignment: assign this point to this cluster center
ii.Calculate the total sum of squared distances for all data points to their assigned cluster center
b.Assignment: assign the data point to the cluster center that resulted in the smallest total sum of squared distances
2.Repeat Step 1 for every data point
3.Repeat Steps 1 and 2 until convergence of maximum number of iterations is reached
Immediately, you may notice that this algorithm is much more computationally intense. It is constantly calculating the (squared) distances for all data points to their assigned cluster centers, and using the sum of these values to determine the best possible cluster assignment for each data point.
However, unlike the other two common algorithms, the Hartigan-Wong algorithm directly optimizes the sum of squared differences across the entire dataset at every step of the process. This means that any given data point might be assigned to a cluster center that is NOT the closest cluster center to that data point! This can happen if assigning the data point to a cluster whose center is farther away from the data point means that the total sum of squared differences of all points to their cluster centers is reduced. One way to understand how this might happen is if assigning the data point to the closer cluster center results in that cluster center shifting away from the other points in the cluster.