Global AI and Data Science

 View Only
  • 1.  Seeking Advice on Optimizing ML Model for Imbalanced Datasets

    Posted 6 days ago

    Hi everyone,

    I'm currently working on a machine learning project where I need to classify a highly imbalanced dataset. The dataset pertains to healthcare diagnostics, with about 95% of the data belonging to the "negative" class and only 5% to the "positive" class.

    I've tried using oversampling techniques like SMOTE and undersampling, but I'm still facing issues with model performance. While precision has improved, recall remains suboptimal, and the model struggles to generalize during cross-validation.

    Here's what I've done so far:

    • Experimented with class weights in algorithms like Random Forest and XGBoost.
    • Tuned hyperparameters extensively, including max_depth, learning rate, and n_estimators.
    • Tested different evaluation metrics like F1-score and AUC-ROC instead of accuracy.

    Despite these efforts, I'm unable to strike the right balance between false positives and false negatives, which is critical for this use case.

    Has anyone tackled a similar issue? I'd love to hear about specific techniques, libraries, or workflows that have worked for you in handling imbalanced datasets. Also, any advice on addressing overfitting in such cases would be appreciated!

    Thanks in advance for your insights!



    ------------------------------
    Ethan levi
    ------------------------------


  • 2.  RE: Seeking Advice on Optimizing ML Model for Imbalanced Datasets

    Posted 3 days ago

    Hi Ethan!

    It seems that you are doing everything right. SMOTE, undersampling and AUC-ROC were excellent choices.

    I haven't worked on this issues for a while now but from my experience in science I would say you have exhausted the information that can be extracted from your data using these methods.


    So I suggest two similar ways forward, that you haven't mentioned trying, which I believe can help if you, especially if you have "a lot" of features:
    1. Use external knowledge about the data: what do experts say? how have they been classifying so far? are there features to be ignored or created? could it be that some data are erroneous? have you done the basic 101 removal of duplicates :-)
    2. Principal Component Analysis (PCA): What features are correlated or can be combined? Which are the important features? Josh Starmer of StatQuest has a few awesome videos explaining PCA. It seems that you have enough knowledge of programming and libraries to figure out the implementation.

    Lastly I would love to know how IBM's AutoAI performs vs your implementations. IBM Cloud is free and with 10 CUH per month you should be able to run a few experiments there. You will find it as an Asset in Projects (formerly known as Watson Studio)

    I hope this helps,

    Loucas 



    ------------------------------
    Loucas Loumakos
    ------------------------------