K-means is a unsupervised learning algorithm where one wants to classify the clusters. It involves initialization and performing iterative Expectation Maximization (EM) steps until convergence or when maximum iteration is reached.
During initialization, k number of centroids are assigned based on the k-means optimization algorithm. In the Expectation (E) step, data points are assigned to clusters following their closest (by Euclidean distance) centroid. In the Maximization (M) step, centroids are updated to minimize the inertia or within-cluster sum-of-squares (WCSS). EM steps are repeated until convergence to local minima where the cluster assignment and centroids do not change.
When to use K-means
1. You want interpretability: K-means is easy to understand and interpret.
2. Clusters are even-sized and globular-shaped: K-means work well when clusters are well-separated globular shapes but do not perform well if clusters are long and irregularly shaped.
When to NOT use K-means
1. You are unsure about the number of clusters: K-means require the number of clusters to be predefined. Usually, the Elbow method, which plots WCSS against the number of clusters, is used to determine the optimal number of clusters.
2. You have outliers in the data: All data points are assigned to a cluster, hence the presence of outliers can skew the results and have to be removed or transformed.
3. You want computation efficiency: The computation cost of K-means increases with the size of data as it runs in O(tkn) time where t is the number of iterations, k is the number of clusters, and n is the number of data points. Using dimensionality reduction algorithms such as PCA can speed up computation.
QUESTION I: Is K-median a viable option to handle outliers while clustering?
QUESTION II: How to set the value of K prior to the analysis?
REFERENCE: 6 Types of Clustering Methods — An Overview Blog