Global AI and Data Science

 View Only

The Datum - Machine Learning : Unsupervised – k-means Clustering and Bootstrapping

By Neeraj Jangid posted Tue June 11, 2019 02:29 PM

  

This article is in continuation to our previous topic ‘Unsupervised Machine Learning’. Today I’m giving you another powerful tool on this topic named ‘k means Clustering‘. The work in this article is on the continuation of the previous WHO data set featured in ‘Machine Learning: Unsupervised – Hierarchical Clustering and Bootstrapping’. This artifact demonstrates implementing k means clustering and bootstrapping to make sure that the algorithm and clusters formed stand true. The bootstrapping will also evaluate whether we have the right amount of clusters or not and if not what should be the ‘k’ i.e. number of clusters based on ‘Calinski-Harabasz Index’ and ‘Average Silhouette Width (ASW)’. Further, this report relates to how closely identical results are given by k-means with respect to the hierarchical clusters which were formed in the previous post. This is important because, no matter what different algorithms I use for computing clusters, the relation between, the data will remain the same. This indicates clusters should be formed more or less the same (with minimal trade-offs) irrespective of the algorithms used by an individual as data and relationship between the data does not change. Excited?

2. The Datum Promise

Before we dive straight into business, ‘The Datum’ is overwhelmed with the response perceived for its first blog ‘Machine Learning: Unsupervised: Hierarchical Clustering and Bootstrapping’. The blog was viewed in 12 countries: The United States, India, Germany, United Kingdom, Mexico, Jordan, Netherlands, Georgia, Nigeria, Brazil, Chile, Poland, and France. ‘The Datum’ extends its gratitude towards such a welcome and response for the very first article. Datum promises to continue its work with same energy and sincerity for a single motivation of giving you the practical Data Science and Machine Learning tools, with algorithms, codes, and data so that you can enhance and build upon a strong base that Datum provides. Data science is booming and is knocking hard, supposed to have a bright future and the Datum is striving to make your data science experience a cake walk. In return to keep us here at ‘The Datum’ we highly encourage you to spend a few minutes and like, share and subscribe ‘The Datum’ blog space so that we can better experience your needs and try to come up better each time. Datum is ready to provide support if any errors occur during the algorithm implementation/execution, please feel free to write back. You keep us going and we expect a two-way communication so that we can give you more in Data Science.

Tools

From the feedback of the previous article, it was unclear about which tools I have been using to execute the algorithms. I execute my algorithms in R. R is a statistical language used in Data Science for programming and implementing algorithms. R-Studio is a studio which runs on R language and is used by me mainly for better interface and GUI. To set up R-Studio you first need to download R and later you can download R-Studio on it. This link can probably help get you through(ping if you have any further difficulties):https://www.andrewheiss.com/blog/2012/04/17/install-r-rstudio-r-commander-windows-osx/

3. The Audience

For our audiences, we expect only knowledge and functioning of beginner level of R language. We try to elaborate and explain all the listings clearly and as simple as possible so you do not worry about the listing used of implementing algorithms. The Datum will try to cover major practical algorithms used in Data Science and Machine Learning to power you in this field. We just need your interest and ability to learn things and your continuous support.

4. Content Briefing

Following are the contents covered in this article for k-means clustering

  1. k-means theoretical implementation
  2. Implementing k-means algorithm in R
  3. Addressing optimal ‘k’ value using Calinski-Harabasz Index and Average Silhouette Width Index
  4. Bootstrapping k-means clusters for its validation and confidence

5. The Model

5.1. Data

I have used R language to code for clustering on the World Health Organization (WHO) data, containing 35 countries limited to the region of ‘America’[3]. Following are the data fields:

  1. $Country: contains names of the countries of region America (datatype: string)
  2. $Population: contains population count of respective countries (datatype: integer)
  3. $Under15: count of population under 15 years of age (datatype: number)
  4. $Over60: count of population over 60 years of age (datatype: number)
  5.  $FertilityRate: fertility rate of the respective countries (datatype: number)
  6. $LifeExpectancy: Life expectancy of people of respective countries (datatype: integer)
  7. $CellularSubscribers: Count of people of respective countries possessing cellular subscriptions (datatype: number)
  8. $LiteracyRate: Rate of Literacy of respective countries (datatype: number)
  9. $GNI: Gross National Income of respective countries (datatype: number)

To view the complete article on k-means clustering and bootstrapping visit:
https://thedatum.data.blog/2019/06/02/machine-learning-unsupervised-k-means-clustering-and-bootstrapping/

Here, you will get access to complete listings and outputs of the algorithm implemented in R. Also you get access to the data used in this article to implement clustering. Hit the link above and get ready with R to implement your k-means machine learning algorithm! 

Cheers, 
The Datum
https://thedatum.data.blog/
#GlobalAIandDataScience
#GlobalDataScience
1 comment
13 views

Permalink

Comments

Mon July 08, 2019 02:50 AM

Well Explained!! Thanks