Global AI and Data Science

 View Only

K Means Clustering

By Moloy De posted Fri October 27, 2023 09:54 PM

  
K-means is a unsupervised learning algorithm where one wants to classify the clusters. It involves initialization and performing iterative Expectation Maximization (EM) steps until convergence or when maximum iteration is reached.
 
During initialization, k number of centroids are assigned based on the k-means optimization algorithm. In the Expectation (E) step, data points are assigned to clusters following their closest (by Euclidean distance) centroid. In the Maximization (M) step, centroids are updated to minimize the inertia or within-cluster sum-of-squares (WCSS). EM steps are repeated until convergence to local minima where the cluster assignment and centroids do not change.
 
When to use K-means
 
1. You want interpretability: K-means is easy to understand and interpret.
2. Clusters are even-sized and globular-shaped: K-means work well when clusters are well-separated globular shapes but do not perform well if clusters are long and irregularly shaped.
 
When to NOT use K-means
 
1. You are unsure about the number of clusters: K-means require the number of clusters to be predefined. Usually, the Elbow method, which plots WCSS against the number of clusters, is used to determine the optimal number of clusters.
2. You have outliers in the data: All data points are assigned to a cluster, hence the presence of outliers can skew the results and have to be removed or transformed.
3. You want computation efficiency: The computation cost of K-means increases with the size of data as it runs in O(tkn) time where t is the number of iterations, k is the number of clusters, and n is the number of data points. Using dimensionality reduction algorithms such as PCA can speed up computation.


QUESTION I: Is K-median a viable option to handle outliers while clustering?

QUESTION II: How to set the value of K prior to the analysis?

REFERENCE: 6 Types of Clustering Methods — An Overview Blog

1 comment
8 views

Permalink

Comments

Mon April 08, 2024 05:54 PM

To help with your first question;

Yes, K-median clustering can be a viable option for handling outliers to some extent, but it might not be the best choice in all scenarios. Here's how K-median clustering can address outliers:

  1. Robustness to outliers: K-median clustering tends to be more robust to outliers compared to K-means clustering. This is because K-median uses medians instead of means to calculate cluster centroids. Since medians are less sensitive to outliers than means, K-median can produce better cluster assignments when dealing with datasets that contain outliers.

  2. Distance metric choice: The choice of distance metric in K-median clustering can also affect its robustness to outliers. Using robust distance metrics such as Manhattan distance or Mahalanobis distance can further improve the ability of K-median to handle outliers.

  3. Parameter tuning: Adjusting the value of K (the number of clusters) can also help in dealing with outliers. Increasing the value of K can allow outliers to form their own clusters instead of being lumped together with other data points.

However, despite these advantages, K-median clustering may still struggle with very high-dimensional data or datasets with extremely skewed distributions of points. In such cases, other clustering algorithms specifically designed to handle outliers, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or OPTICS (Ordering Points To Identify the Clustering Structure), might be more suitable.

In summary, while K-median clustering can offer some degree of robustness to outliers, its effectiveness depends on the specific characteristics of the dataset and the clustering task at hand. It's essential to consider the nature of the data and possibly experiment with different clustering algorithms to find the most suitable one for handling outliers.