Hi everyone,
I'm currently working on a machine learning project where I need to classify a highly imbalanced dataset. The dataset pertains to healthcare diagnostics, with about 95% of the data belonging to the "negative" class and only 5% to the "positive" class.
I've tried using oversampling techniques like SMOTE and undersampling, but I'm still facing issues with model performance. While precision has improved, recall remains suboptimal, and the model struggles to generalize during cross-validation.
Here's what I've done so far:
- Experimented with class weights in algorithms like Random Forest and XGBoost.
- Tuned hyperparameters extensively, including max_depth, learning rate, and n_estimators.
- Tested different evaluation metrics like F1-score and AUC-ROC instead of accuracy.
Despite these efforts, I'm unable to strike the right balance between false positives and false negatives, which is critical for this use case.
Has anyone tackled a similar issue? I'd love to hear about specific techniques, libraries, or workflows that have worked for you in handling imbalanced datasets. Also, any advice on addressing overfitting in such cases would be appreciated!
Thanks in advance for your insights!
------------------------------
Ethan levi
------------------------------