Global AI and Data Science

 View Only

The Datum - Machine Learning : Unsupervised – Hierarchical Clustering and Bootstrapping

By Neeraj Jangid posted Tue June 11, 2019 02:17 PM



This article is based on Unsupervised Learning algorithm: Hierarchical Clustering. This is the brief illustration with a practical working example of forming unsupervised hierarchical clusters and testing them to assure that you have formed the right clusters. This is a real-life data world example which can be studied and evaluated as data is provided for personal use and practice. There are variations to each topic in data science but there is a brief basic pattern that can be followed to build models. “The Datum” empowers you to have access to these basic patterns for your lifetime and building upon them as you progress. Consider “The Datum” blogs as your cookbook hand out which will help you learn, refer, and contribute the relevant topics. All listings and models are implemented in R Language using R Studio, and image instances of my work are embedded in this article for your reference.

~ The Datum

The Audience

This piece of work has something for everyone, and looking at the length of the article you can expect the following, depending upon your requirement. A complete reading of this article is a mandate if you want to capitalize on what is data science and machine learning along with the Hierarchical Cluster implementation and bootstrap. For the users who have been into Data Science for a while now and have theoretical knowledge and are here just to look for algorithm implementation and working, you can directly skip numbering 1,2, & 3 and go to 4 i.e. Hierarchical Clustering: Approach. Lastly, if you have nothing to do with Data Science you can certainly help people in this field by sharing this article and I assure this which will help them learn and gain a conceptual understanding of Data Science. Have a good read!

1.The Datum

‘Datum’ is a single entry/element/instance of a very large data, the smallest possible part of the largest data you can think of. These individual Datums combine and form such big data. In this data-driven world, we trust the data to an extent that we change everything as per its behavior. The small Datum’s when combined together can do wonders, so just imagine when you and I can combine together we can bring out magic to this data world. This is the basis of the formation of ‘The Datum’ – a platform where I will be blogging about the simple as well as the most complex concepts in Data Science and Machine Learning. This will empower you and give you the tools you need to analyze these real-world data. My intent is to keep up with the best of the concepts and algorithms (with data, written codes, and output) for you once per week. The Datum as a platform will empower you to learn and get the basic ideas to how to go about the concepts and algorithms. There is more to this space, if you are naive to data science this blog will make you curious to know more and if you are a pro and in the data world, your contribution/suggestion is highly appreciated.

2. Why Data Science?

“Over 2.5 Quintilian bytes of data are created every single day (one of such is created by me right now), and it’s only going to grow exponentially from there. By 2020, it is estimated that 1.7 MB of data will be created every second for every person on earth[1]. Can you figure out what’s going on out there? This humongous amount of data we are creating in this digital world, these numbers are unbelievable. We need tools to analyze this massive data and algorithms so that we can make data speak and give us insights and help us to fix things in a direction where everyone is at a gain. To give you an example in brief: you are surfing data and you simply tap/click on a certain product in your Amazon’s shopping window, this is used to analyze your likes and dislikes, and the next time you browse your shopping recommendations change accordingly. This is the smallest example of how data can work wonders. Your every click/tap/surf is creating data for the respective apps you are using and this data is used for the betterment of your personalized experience. Just think of a technology today and try to imagine it without data, impossible!

3. What is Machine Learning

Humans have evolved over the years and have come to be what we say the smartest existing species on the planet Earth. Going back to ancient days, what do you think was the basis of the human evolution of what we are today? ‘Learning’ was the key for this advancement from the very first man to the most mindful species on the planet Earth. For example, consider how an intellect thought of rubbing two stones against each other can lit up the fire. This one source of energy which was then discovered by the early man as a need was a gateway to multiple tangents like preparing food, protection against animals and insects, as a light source in the dark nights of forests. The fire has progressed in today’s world that you can find its application in daily lives. This has been a classic example of learning’ and its growth to better ourselves and find useful applications moving forward. The 1980s was the first decade where we saw a technological advancement in the form of ‘the first computer’. Since then, same as fire, there has been a massive expansion in the computer world to the smart or supercomputers which we call them today. What was the cause of this advancement? Yes, you are right, ‘Learning’. Humans researched and learned new technologies and blocks kept on adding to the system and today we see smart computers.

‘Growth’, is the on-going process and always advances when plotted against time. It is an integral part of our ecosystem as things keep getting bigger and better on the basis of learning. In recent times, we humans are giving a new dimension to the computers and this is called as ‘Machine Learning’. Machine Learning is a method where a system is fed with data, and then machines interpret these data, find trends and build models to bring insights for these data from which we can make the most. These trends help us get better in the fields like sales, precision medicines, tracking locations, fraud detection and handling, advertising and lastly of course entertainment media[2]. We just keep bettering ourselves by learning and now giving this ‘learning’ power to machines we are just knocking doors of another miracle in the expansion of technology.

Basically, there are two methods (discovered so far, we never know what’s next) in which machines can learn; they are Supervised and Unsupervised methods of learning. This article focuses on the latter part – the ‘Unsupervised Learning’. Let’s go back to the discovery of fire first made by a human, now understand; was the first human supervised to do so? Did he see somewhere this could happen and just replicated? Was there any source which he could refer to, to go about the procedures to light a fire? The answer clearly is ‘no’. What he had was his instinct, two stones which could fire up dry grass with a spark and this may be incurred him by daily observing things around him and finally an instinct to do so. That’s all about unsupervised learning, after the first fire discovery the latter applications and evolution of fire for the basis of preparing food, a light source in dark forests, protection against animals and insects, etc. all this is termed as supervised learning, where human already had tools to light but just the applications differed. These differed applications also required learning but since he already had the basic procedure to light a fire, it is termed as supervised learning.

3.1. Definitions

Unsupervised Learning: In unsupervised learning, we try to relate the input data in some of the other way so that we can find a relationship in the data and capitalize our service based on the data trend or relations developed in unsupervised learning.

Example: Based on the ‘likes’ of people on an online music library, we can cluster people having same tastes of music and accordingly recommend them the similar type of music so that we can have them involved in our music library which is a service. You now got an idea of how you get those associations and recommendations on Amazon, You-tube, Netflix, etc.

Below is the notional visualization explaining the difference between supervised (algorithms coming soon) and unsupervised learning (Hierarchical Clustering addressed in this blog)

Notional Visuals: Unsupervised vs Supervised Machine Learning

3.2 Unsupervised Learning Algorithm

Clustering is one of the methods of Unsupervised Learning Algorithm: Here we observe the data and try to relate each data with the data similar to its characteristics, thus forming clusters. These clusters hold up a similar type of data which is distinct to another cluster. For example, a cluster of people liking jazz music is distinct from the cluster of people enjoying pop music. This work will help you gain knowledge of one of the of clustering method namely: hierarchical clustering.

Hierarchical Clustering: As the name describes, clustering is done on the basis of hierarchy (by mapping dendrogram: explained further in a practical example of this work)

4. Hierarchical Clustering: Approach

4.1 Density and Distances

Clustering from early stages was developed on the basis of these two instances i.e. Density and Distance. Let us understand each of them:

Density: Goes with the name clearly, if you have a denser data in a particular plane and another dense data in the same plane but at a distance is what known as density clusters on data.

Distance: Two distinct clusters or even data to be a part of cluster 1 or cluster 2 depend upon the separation distance between the two. The distance in Data Science can be computed on the basis of Euclidean distance, Manhattan (City Block) distance, Hamming distance, Cosine distance. These distances are the basics and easy algebraic and geometric understanding. Can be easily refreshed if you are not well versed with them by going back to basics. The figure below illustrates Density and Distance and how it brings clustering, there are two types of data both having a specific density in their respective space, thus they are clustered together as both clusters have distinct characteristics based upon the data they contain. Further, there is also a distance between the two clusters, these distances can be between the individual datums or considering clusters as a whole.

Notional Illustration of Density and Distance in Hierarchical Clustering

4.2 Implementing Hierarchical Clustering in R

4.2.1 The Model Data:

I have used R language to code for clustering on the World Health Organization (WHO) data, containing 35 countries limited to the region of ‘America’[3]. Following are the data fields:

  1. $Country: contains names of the countries of region America (datatype: string)
  2. $Population: contains population count of respective countries (datatype: integer)
  3. $Under15: count of population under 15 years of age (datatype: number)
  4. $Over60: count of population over 60 years of age (datatype: number)
  5.  $FertilityRate: fertility rate of the respective countries (datatype: number)
  6. $LifeExpectancy: Life expectancy of people of respective countries (datatype: integer)
  7. $CellularSubscribers: Count of people of respective countries possessing cellular subscriptions (datatype: number)
  8. $LiteracyRate: Rate of Literacy of respective countries (datatype: number)
  9. $GNI: Gross National Income of respective countries (datatype: number)

4.2.2. Setting Goals and Expectations:

The goal here is to group countries in terms of their health using data fields mentioned above. 

For a further complete explanation of the implementing Unsupervised algorithm in R visit:

Here, you will find the complete blog with R listings and relevant output, also you get access to the data we just used here.