Improving Breast Cancer Prediction Using Adaptive Synthetic Sampling: A Study on the Coimbra Dataset

تحسين التنبؤ بسرطان الثدي باستخدام أخذ عينات اصطناعية تكيفية (Adaptive Synthetic Sampling): دراسة على مجموعة بيانات كويمبرا

Ayman Alsabry, Abeer A Shujaaddeen, Mogeeb AA Mosleh

2025 5th International Conference on Emerging Smart Technologies and Applications (eSmarTA) · 2025 · pp. 1–8

IEEE

Abstract

Class imbalance remains a significant challenge in breast cancer classification, leading to biased predictive models that favor the majority class. Addressing this issue is crucial for improving early detection and diagnostic accuracy. This study investigates the impact of Adaptive Synthetic Sampling (ADASYN) on the performance of various machine learning models for breast cancer prediction using the Breast Cancer Coimbra Dataset (BCCD). A total of 36 machine learning models, including decision trees, support vector machines (SVMs), k-Nearest Neighbors (KNNs), neural networks, and ensemble-based methods, were trained and evaluated both before and after applying ADASYN. Performance was assessed using accuracy as the primary metric. The findings demonstrate that balancing the dataset significantly enhances classification performance, with Subspace KNN achieving the highest accuracy (91.7%) after ADASYN. However, some models, such as Linear SVM and certain neural networks, exhibited performance declines, highlighting the varying impact of synthetic oversampling across different algorithms. This study underscores the importance of data preprocessing techniques in medical diagnostics, demonstrating that adaptive oversampling can improve predictive accuracy but requires careful model selection. Future research should explore hybrid balancing techniques and feature selection methods to further enhance classification robustness.

Keywords

Classification Machine Learning Imbalanced Data Predictive Models Imbalance Dataset Oversampling