Global AI and Data Science

 View Only
  • 1.  k-means elbow method

    Posted Thu March 19, 2020 11:47 AM
    I am trying to use K-means clustering to segregate areas in Vancouver. Now every time I am running the code to find the optimum value of k, I find the distortion vs k line (curve) changing. What could be a reason for this?

    Thanks,

    ------------------------------
    Aparna Mookerjee
    ------------------------------

    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: k-means elbow method

    Posted Fri March 20, 2020 07:05 AM
    Hi Aparna,

    I don't have an answer for you but rather more questions.  Do you happen to have the code?  And can you share here?  And at any time are you seeding that algorithm?  If so, are you using the same seed or do you vary it?  It's been about 2 years since I last worked with K-means clustering.  So, excuse me if my questions are a little off :)  Very curious to see if anybody else has an answer.   

    Thanks,

    Chris

    ------------------------------
    Chris Hoina, Offering Manager
    IBM Application Performance Analyzer for z/OS
    chrishoina@ibm.com
    ------------------------------



  • 3.  RE: k-means elbow method

    Posted Fri March 20, 2020 08:41 AM
    It's normal to see the curve changing because, in the elbow method, the cluster range depends on K so with the problem in hand, to cluster the right range to get the right results, you will get distortions until you provide the optimal value of K.

    One way is to keep providing the value by A/B testing until you get the right value with minimum distortion

    Another way is to use Grid Search for Hyperparameter Tuning. Sklearn library provides us with functionality to define a grid of parameters and to pick the optimum one. Use the grid to define parameters from 1-10 for the K-means elbow method.

    ------------------------------
    Muhammad Tehseen
    ------------------------------



  • 4.  RE: k-means elbow method

    Posted Sat March 21, 2020 05:45 AM
    Here is the code that I am running, I am running a variable k with range between (2,10) to find the optimum vale of k

    from sklearn.cluster import KMeans
    from matplotlib import pyplot as plt

    X = vancouver_grouped.drop('Neighborhood', 1)
    dist = []
    for k in range(2, 10):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    dist.append(kmeans.inertia_)

    fig = plt.figure(figsize=(10, 10))
    plt.plot(range(2, 10), dist, '-o')
    plt.grid(True)

    plt.xlabel('Number of clusters k')
    plt.ylabel('Distortion')
    plt.title('The Elbow Method showing the optimal k')

    I just ran the same code 4 times, attached screenshots. I wasn't sure which value of k to use, so I used 5 then I used 4 ( I am aware that I can take an arbitrary value), currently I am using 4. Tell me what you think should be k.

    ------------------------------
    Aparna Mookerjee
    ------------------------------



  • 5.  RE: k-means elbow method

    Posted Sat March 21, 2020 07:53 AM
    From the looks of the graph, you should use the value between 6-7, as if we use 10, we might overfit the solution or lose necessary predictions

    ------------------------------
    Muhammad Tehseen
    ------------------------------