Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
Hierarchical clustering can be performed with a distance matrix on raw data. When raw data is provided, the software will automatically compute a distance matrix in the background. Following are the various distances that can be used:
Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps:
- Identify the two clusters that are closest together
- Merge the two most similar clusters. This iterative process continues until all the clusters are merged together. This is illustrated in the diagrams below.
The main output of Hierarchical Clustering is a dendrogram, which shows the hierarchical relationship between the clusters:
Hierarchical clustering typically works by sequentially merging similar clusters, as shown above. This is known as agglomerative hierarchical clustering. In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters. This is known as divisive hierarchical clustering. Divisive clustering is rarely done in practice.
QUESTION I: Could Hierarchical Clustering be used in choosing the value of K in K Means Clustering?
QUESTION II: How to implement Hierarchical Clustering in massive data?
REFERENCE: What is Hierarchical Clustering?