Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only
  • 1.  DBSCAN on categorical data

    Posted Thu March 05, 2020 01:47 PM
    Hi,

    I trying to find anomalies from my Data and I have read that DBSCAN is one of the best clustering algorithm for anomalies detection. My data consists of categorical data, does distance measurement still works?? How should I go about doing it?

    It would be appreciated if anyone can share their knowledge with me. Thank you in advance!!!
    #GlobalAIandDataScience
    #GlobalDataScience


  • 2.  RE: DBSCAN on categorical data

    Posted Fri March 06, 2020 08:41 AM
    Hi,

    DBSCAN is good for continuous data. If you have categorical, the distance principle won't make sense anymore. e.g. I used cluster analysis for log data, and measuring distance between responses 200 and 505 would make no sense. 

    Take a a look at k-modes method. For mixed data - categorical and continuous, there is a k-prototypes method too.

    ------------------------------
    Mariia Denysenko
    ------------------------------



  • 3.  RE: DBSCAN on categorical data

    Posted Fri March 06, 2020 08:42 AM
    Your data is already in categories which means you don't have to use clustering algorithms though you can re-categorize your data using DBSCAN and get noise points from that analysis, the more noise points, the more anomalies or outliners you can see in your data.

    I can't be certain about "distance measurement" for DBSCAN because I don't have much idea about your data, is your categories based on your numerical scope then you might be able to tweak it for results.

    ------------------------------
    Muhammad Tehseen
    ------------------------------



  • 4.  RE: DBSCAN on categorical data

    Posted Fri March 06, 2020 01:21 PM
    Euclidean distance would not make sense with categorical data

    ------------------------------
    Jon Peck
    ------------------------------



  • 5.  RE: DBSCAN on categorical data

    Posted Mon March 09, 2020 09:10 AM
    I am not a specialist in a domain of clustering but maybe it would be better for you if you would decide to look rather into measures of Similarities instead of measures of Distances.
    Cheers

    ------------------------------
    Konrad Borowiec
    ------------------------------



  • 6.  RE: DBSCAN on categorical data

    Posted Mon March 09, 2020 09:14 AM
    It depends on how many levels are in your categorical features. If they have only 1-3 levels (e.g. male/female/others), you can try to run DBSCAN separately for each of the categories (which makes sense for many use cases). However, if you have lots of levels (e.g. postcodes), you can consider using the numerical representation of it because often they have interval information anyway.

    ------------------------------
    Kevin Siswandi
    ------------------------------



  • 7.  RE: DBSCAN on categorical data

    Posted Mon March 09, 2020 09:19 AM
    hi, if the data is categorical. anomaly can be that entry does not belong to defined categories.
    defined categories for given feature are generaly known
    and  you can easily check anomaly simply by performing inequality check on data entries

    ------------------------------
    rahul rahul
    ------------------------------