When evaluating simulated clinical cases, Open AI's GPT-4 chatbot outperformed physicians in clinical reasoning, a cross-sectional study showed.
Median R-IDEA scores -- an assessment of clinical reasoning -- were 10 for the chatbot, 9 for attending physicians, and 8 for internal medicine residents, Adam Rodman, MD, of Beth Israel Deaconess Medical Center in Boston, and colleagues reported in a research letter in .
In logistic regression analysis, GPT-4 had the highest estimated probability of achieving high R-IDEA scores (0.99, 95% CI 0.98-1.00) followed by attendings (0.76, 95% CI 0.51-1.00) and residents (0.56, 95% CI 0.23-0.90), with the chatbot being significantly higher than both attendings (P=0.002) and residents (P<0.001), they reported.
Rodman told Ƶ that medicine has been searching for ways to improve clinical decision-making, given that misdiagnosis can result in up to 800,000 deaths each year in the U.S. He said large language models (LLMs) like GPT-4 "are one of the most exciting interventions in 50 years" for the clinical reasoning field.
One advantage of AI in this context is that it "doesn't have the cognitive biases humans do," he said. "It's able to second-guess itself and, because of that, can have a more robust differential."
Nonetheless, he said doctors who use GPT-4 today for assistance in diagnosis should be aware of its limitations.
"The justification for physicians using it right now is for experienced physicians who are aware that they could be missing something to de-bias themselves," he said. "You have to have a lot of domain knowledge because otherwise if it tells you something wrong, you might not know."
While GPT-4 distinguished itself on R-IDEAS scores in the study, it had similar outcomes to attendings and residents on diagnostic accuracy and cannot-miss diagnoses, the researchers reported. Median inclusion rate of cannot-miss diagnoses in initial differentials were 66.7% for GPT-4, 50% for attendings, and 66.7% for residents.
However, GPT-4 had more frequent instances of incorrect clinical reasoning than residents (13.8% vs 2.8%, P=0.04), but not attendings (12.5%), they reported.
"This observation underscores the importance of multifaceted evaluations of LLM capabilities preceding their integration into the clinical workflow," the researchers wrote.
For their analysis, Rodman and colleagues recruited internal medicine residents and attending physicians from two academic medical centers from July to August 2023. For the reasoning assessment, they used 20 clinical cases from NEJM Healer, an educational tool that uses virtual patients to assess clinical reasoning. Each case had four sections: triage presentation; review of systems; physical examination; and diagnostic testing.
Each physician took on one case, writing up a differential diagnosis for each of the four sections, and GPT-4 completed all 20 cases on August 17-18, 2023.
The study was limited by the fact that it was based on clinical data from simulated cases, so it's "unclear how the chatbot would perform in clinical scenarios." Also, they noted that they used a "zero-shot approach" to prompt the chatbot, so iterative training may be able to boost the AI's performance.
"Future research should assess clinical reasoning of the LLM-physician interaction, as LLMs will more likely augment, not replace, the human reasoning process," they concluded.
Disclosures
Rodman reported receiving funding from the Gordon and Betty Moore Foundation. The other authors reported receiving personal fees from publishing companies outside the submitted work or being employed by healthcare companies or medical societies.
The study was supported by the Harvard Clinical and Translational Science Center and Harvard University.
Primary Source
JAMA Internal Medicine
Rodman A, et al "Clinical reasoning of a generative artificial intelligence model compared with physicians" JAMA Intern Med 2024;DOI:10.1001/jamainternmed.2024.0295.