Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to discussions

Expand all | Collapse all

Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Ethan leviFri November 22, 2024 02:46 PM

Hi everyone, I'm currently working on a machine learning project where I need to classify a highly ...

Loucas LoumakosMon November 25, 2024 04:38 PM

Hi Ethan! It seems that you are doing everything right. SMOTE, undersampling and AUC-ROC were excellent ...

1. Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Like
Ethan levi
Posted Fri November 22, 2024 02:46 PM

Reply
Hi everyone,

I'm currently working on a machine learning project where I need to classify a highly imbalanced dataset. The dataset pertains to healthcare diagnostics, with about 95% of the data belonging to the "negative" class and only 5% to the "positive" class.

I've tried using oversampling techniques like SMOTE and undersampling, but I'm still facing issues with model performance. While precision has improved, recall remains suboptimal, and the model struggles to generalize during cross-validation.

Here's what I've done so far:

Experimented with class weights in algorithms like Random Forest and XGBoost.

Tuned hyperparameters extensively, including max_depth, learning rate, and n_estimators.

Tested different evaluation metrics like F1-score and AUC-ROC instead of accuracy.

Despite these efforts, I'm unable to strike the right balance between false positives and false negatives, which is critical for this use case.

Has anyone tackled a similar issue? I'd love to hear about specific techniques, libraries, or workflows that have worked for you in handling imbalanced datasets. Also, any advice on addressing overfitting in such cases would be appreciated!

Thanks in advance for your insights!

------------------------------
Ethan levi
------------------------------
2. RE: Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Like
Loucas Loumakos
Posted Mon November 25, 2024 04:38 PM

Reply
Hi Ethan!

It seems that you are doing everything right. SMOTE, undersampling and AUC-ROC were excellent choices.

I haven't worked on this issues for a while now but from my experience in science I would say you have exhausted the information that can be extracted from your data using these methods.

So I suggest two similar ways forward, that you haven't mentioned trying, which I believe can help if you, especially if you have "a lot" of features:
1. Use external knowledge about the data: what do experts say? how have they been classifying so far? are there features to be ignored or created? could it be that some data are erroneous? have you done the basic 101 removal of duplicates :-)
2. Principal Component Analysis (PCA): What features are correlated or can be combined? Which are the important features? Josh Starmer of StatQuest has a few awesome videos explaining PCA. It seems that you have enough knowledge of programming and libraries to figure out the implementation.

Lastly I would love to know how IBM's AutoAI performs vs your implementations. IBM Cloud is free and with 10 CUH per month you should be able to run a few experiments there. You will find it as an Asset in Projects (formerly known as Watson Studio)

I hope this helps,

Loucas

------------------------------
Loucas Loumakos
------------------------------

Original Message

Global AI and Data Science

Global AI & Data Science

Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Ethan leviFri November 22, 2024 02:46 PM

Loucas LoumakosMon November 25, 2024 04:38 PM

1. Seeking Advice on Optimizing ML Model for Imbalanced Datasets

2. RE: Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Ethan leviFri November 22, 2024 02:46 PM

Loucas LoumakosMon November 25, 2024 04:38 PM

1. Seeking Advice on Optimizing ML Model for Imbalanced Datasets

2. RE: Seeking Advice on Optimizing ML Model for Imbalanced Datasets

Additional Resources

Office

Quick Links

Additional
Resources