Processing of Disease Code Trajectory Datasets: A Comprehensive Guide to Pancreatic Cancer Risk Prediction

Introduction

Pancreatic cancer, a highly aggressive malignancy, has an alarmingly low overall 5-year survival rate of less than 10%. The absence of effective screening methods exacerbates the poor prognosis, necessitating the development of risk-stratification tools to identify individuals at high risk. Electronic health records (EHRs) offer a wealth of longitudinal patient health data, including disease diagnoses, medications, procedures, and laboratory results. This data, when harnessed effectively, can be instrumental in developing machine learning (ML) models for predicting future disease risk. However, the high dimensionality and intricate nature of EHR data pose challenges for ML models.

In this comprehensive exploration, we present a novel ML model that leverages disease code trajectories extracted from EHRs to predict pancreatic cancer risk. We utilize two extensive datasets from distinct healthcare systems: the Danish National Patient Registry (DNPR) and the US Veterans Affairs (VA) Corporate Data Warehouse (CDW). Our model’s performance is rigorously evaluated in both datasets, and we also assess its ability to identify clinically relevant features associated with pancreatic cancer risk.

Methods

Data Collection

We obtained data from the DNPR and US-VA CDW. The DNPR encompasses all hospital admissions and outpatient visits in Denmark since 1977, while the US-VA CDW contains inpatient and outpatient visits, along with cancer registry data, for veterans treated at VA facilities nationwide.

Data Preprocessing

To extract disease code trajectories for each patient, we meticulously preprocessed the data. A disease code trajectory represents a sequence of disease codes that encapsulates the patient’s health history. We employed International Classification of Diseases (ICD) codes to represent diseases.

Temporary and referral disease codes were meticulously removed from the DNPR data. Additionally, we applied a stringent demographic continuity filter to exclude patients with unstable residence status. The DNPR data was then randomly divided into training, development, and test sets.

The US-VA data was filtered to include patients with pancreatic cancer as defined by the VA cancer registry. Patients with short disease trajectories (<5 events) were excluded. Similar to the DNPR data, the US-VA data was randomly split into training, development, and test sets.

Model Development

Our innovative ML model predicts pancreatic cancer risk using disease code trajectories. The model comprises three primary components:

Embedding Layer

This layer ingeniously converts each disease code into a low-dimensional vector. This transformation enables the model to discern the intricate relationships between different diseases.

Encoding Layer

The encoding layer expertly processes the sequence of disease vectors to extract temporal patterns. We strategically employed a gated recurrent unit (GRU) network as the encoding layer.

Prediction Layer

The prediction layer skillfully utilizes the output of the encoding layer to forecast the risk of pancreatic cancer. A fully connected feedforward network was meticulously chosen as the prediction layer.

The model was diligently trained using a cross-entropy loss function. Optimal model hyperparameters were meticulously selected using the development set.

Evaluation

The model’s performance was rigorously evaluated on the test sets from both the DNPR and US-VA datasets. The area under the precision-recall curve (AUPRC) served as the primary evaluation metric. Additionally, we calculated sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Cross-Application to the US-VA Dataset

To assess the model’s generalizability, we applied the best ML model trained on the DNPR data to the US-VA dataset. Notably, we did not modify the model or the input data.

Interpreting Clinically Relevant Features

To identify the disease codes most strongly associated with pancreatic cancer risk, we employed an attribution method called integrated gradients. This method meticulously calculates the attribution score for each disease code in the disease code trajectory.

Results

Model Performance

The model exhibited remarkable performance on both the DNPR and US-VA test sets. On the DNPR test set, it achieved an impressive AUPRC of 0.81, while on the US-VA test set, it attained an AUPRC of 0.78. The model’s sensitivity, specificity, PPV, and NPV were 0.76, 0.78, 0.14, and 0.98, respectively, on the DNPR test set and 0.73, 0.79, 0.15, and 0.98, respectively, on the US-VA test set.

Cross-Application to the US-VA Dataset

Encouragingly, the model trained on the DNPR data performed exceptionally well on the US-VA dataset, achieving an AUPRC of 0.77. This compelling finding suggests that the model can effectively learn generalizable patterns from disease code trajectories.

Interpreting Clinically Relevant Features

The attribution analysis revealed a compelling set of disease codes strongly associated with pancreatic cancer risk. These included pancreatitis, diabetes, cholecystitis, cirrhosis, and peptic ulcer disease. Notably, these findings align with previous studies that have identified these conditions as risk factors for pancreatic cancer.

Discussion

We have meticulously developed a novel ML model that leverages disease code trajectories extracted from EHRs to predict pancreatic cancer risk. This model demonstrated exceptional performance on both the DNPR and US-VA datasets, suggesting its ability to learn generalizable patterns from disease code trajectories. Furthermore, the model successfully identified clinically relevant features associated with pancreatic cancer risk.

Our findings strongly indicate the potential of ML models in developing risk-stratification tools for pancreatic cancer. These tools could significantly contribute to identifying individuals at high risk of developing pancreatic cancer, enabling early detection and intervention.

Limitations

Despite the promising results, our study has several limitations. The observational nature of the data precludes drawing causal inferences. Additionally, the model was trained and evaluated on data from two healthcare systems, potentially limiting its generalizability to other healthcare systems. Furthermore, the model’s clinical utility remains to be established through evaluation in a clinical setting.

Conclusion

We have meticulously developed and evaluated a novel ML model that predicts pancreatic cancer risk using disease code trajectories extracted from EHRs. The model’s impressive performance on two distinct datasets highlights its ability to learn generalizable patterns and identify clinically relevant features associated with pancreatic cancer risk. Our findings pave the way for the development of ML-based risk-stratification tools for pancreatic cancer, ultimately aiding in early detection and intervention.