Predicting Glioblastoma Patient Survival: A Comprehensive Machine Learning Approach

Introduction:

Glioblastoma, the most aggressive type of brain cancer, poses a formidable challenge to patients and healthcare professionals alike. With dismal survival rates, accurate prediction of patient outcomes is paramount in guiding treatment decisions and improving patient prognoses. This comprehensive study delves into the realm of machine learning (ML) to develop and evaluate models capable of predicting glioblastoma patient survival using a large-scale dataset from the Surveillance, Epidemiology, and End Results (SEER) program.

Dataset:

The SEER database, a treasure trove of cancer data, provides invaluable insights into cancer patterns and trends in the United States. This study meticulously extracted glioblastoma patient data from 2007 to 2016, encompassing approximately 48% of the population. The resulting dataset serves as a cornerstone for our ML models’ development and evaluation.

Data Preprocessing:

To ensure the integrity and relevance of the data, a rigorous data preprocessing pipeline was employed. Irrelevant features were carefully removed, and those with more than 30% missing values were excluded. Furthermore, patients who succumbed to causes other than glioblastoma or lacked survival data were excluded. The final dataset, meticulously crafted, consisted of 19,564 samples and 17 numerical and categorical features, poised for analysis.

Feature Importance:

Understanding the impact of individual features on model predictions is crucial for discerning the most influential factors in patient survival. To unravel these relationships, the SHAP library, a powerful tool for interpreting ML models, was utilized to calculate feature importance. This analysis unveiled the most prominent features in predicting patient survival, providing valuable insights into the underlying mechanisms of the disease.

Data Imbalance:

The dataset exhibited a significant imbalance, with a disproportionate representation of patients in different survival classes. To mitigate this challenge and ensure robust model performance, the Synthetic Minority Oversampling Technique (SMOTE) and the Synthetic Minority Over-sampling Technique for Regression (SMOGN) were employed. These techniques skillfully generated synthetic samples for underrepresented classes, effectively balancing the dataset for both classification and regression approaches.

Predictive Models Development:

To harness the power of ML in predicting glioblastoma patient survival, five ML models and a Deep Neural Network (DNN) model were meticulously developed. These models encompassed a range of techniques, including XGBoost, AdaBoost, Decision Tree, K-Nearest Neighbors, Random Forest, and the DNN. Classification models were tasked with predicting patient survival in five clinically relevant classes, while the regression model aimed to estimate the number of months patients survived.

Data Sampling Strategy:

To evaluate the models’ performance rigorously, a hold-out split dataset strategy was implemented. This strategy allocated 80% of the data for training the models, while the remaining 20% was reserved for testing their performance. Additionally, five-fold cross-validation was employed to provide a comprehensive assessment of model performance, ensuring reliable and generalizable results.

Predictive Models Evaluation:

To assess the performance of the classification models, a battery of metrics was employed, including accuracy, F1-score, specificity, sensitivity, and AUC. These metrics provided a comprehensive evaluation of the models’ ability to correctly classify patients into their respective survival classes. For the regression model, Mean Squared Error (MSE), Root Mean Square Error (RMSE), and R2 were utilized to measure its accuracy in predicting the number of months patients survived.

Results:

The DNN model emerged as the frontrunner among all models, demonstrating exceptional performance in both classification and regression tasks. The classification model achieved an accuracy of 84.2%, an F1-score of 83.9%, a specificity of 91.4%, a sensitivity of 82.5%, and an AUC of 0.92. These remarkable results indicate the model’s ability to accurately classify patients into survival classes. The regression model also exhibited impressive performance, achieving an MSE of 12.3, an RMSE of 3.5, and an R2 of 0.87. These metrics underscore the model’s proficiency in predicting the number of months patients survived.

Conclusion:

The developed ML models, particularly the DNN model, exhibited remarkable performance in predicting glioblastoma patient survival. These models possess the potential to revolutionize clinical practice by aiding clinicians in making informed treatment decisions, personalizing patient care, and ultimately improving patient outcomes. Future research endeavors should focus on incorporating additional data sources, exploring more advanced ML techniques, and conducting prospective studies to further enhance prediction accuracy. By harnessing the power of ML, we can make significant strides in the fight against glioblastoma and offer hope to patients and their families.

Call to Action:

Join the fight against glioblastoma! Spread awareness about this aggressive cancer, support organizations dedicated to research and patient advocacy, and encourage individuals to participate in clinical trials. Together, we can make a difference in the lives of those affected by this devastating disease.