Exploring the Diagnostic Reasoning Capabilities of Large Language Models: GPT-3.5 and GPT-4

Abstract

With the advent of large language models (LLMs) like GPT-3.5 and GPT-4, the healthcare industry is abuzz with anticipation about their potential to revolutionize clinical diagnosis. This study delves into the diagnostic reasoning abilities of GPT-3.5 and GPT-4, examining whether these models can simulate the clinical reasoning processes of human healthcare providers.

Introduction

LLMs have showcased remarkable human-like performance in various tasks, including writing clinical notes, passing medical exams, and generating patient summaries. However, their ability to engage in clinical diagnostic reasoning, a cornerstone of medical practice, remains largely unexplored.

Study Design and Methods

This study employs a comprehensive approach to assess the diagnostic reasoning capabilities of GPT-3.5 and GPT-4. The researchers harnessed two datasets: the revised MedQA United States Medical Licensing Exam (USMLE) dataset and a collection of case series from the prestigious New England Journal of Medicine (NEJM).

To evaluate the models’ performance, the team employed a diverse range of prompting techniques, including conventional chain-of-thought (CoT) prompting and novel diagnostic reasoning prompts inspired by the cognitive processes of human clinicians. These prompts were meticulously crafted to simulate the thought processes involved in formulating differential diagnoses, analytical reasoning, Bayesian inferences, and intuitive reasoning.

Results

The study revealed that GPT-4 exhibited impressive accuracy in clinical diagnostic reasoning tasks, consistently outperforming GPT-3.5. GPT-4 achieved accuracies of 76%, 77%, 78%, 78%, and 72% with classical chain-of-thought, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inferences, respectively.

Interestingly, the study found that prompts promoting step-by-step reasoning and focusing on a single diagnostic reasoning strategy yielded superior results compared to those combining multiple strategies. This suggests that LLMs may benefit from structured and focused prompts to enhance their diagnostic reasoning capabilities.

Discussion

The study’s findings provide valuable insights into the diagnostic reasoning abilities of LLMs, highlighting their potential to assist clinicians in diverse healthcare settings. However, further research is warranted to fully comprehend the limitations and biases of these models and to develop strategies for their safe and effective integration into clinical practice.

Conclusion

This study represents a significant leap forward in understanding the clinical diagnostic reasoning capabilities of LLMs. The findings suggest that these models have the potential to revolutionize clinical diagnosis, but further research is necessary to ensure their accuracy, reliability, and trustworthiness in real-world clinical settings.