The artificial intelligence (AI) chatbot ChatGPT outperformed physicians when answering patient questions, based on quality of response and empathy, according to a cross-sectional study.
Of 195 exchanges, evaluators preferred ChatGPT responses to physician responses in 78.6% (95% CI 75.0-81.8) of the 585 evaluations, reported John Ayers, PhD, MA, of the Qualcomm Institute at the University of California San Diego in La Jolla, and co-authors.
The AI chatbot responses were given a significantly higher quality rating than physician responses (t=13.3, P<0.001), with the proportion of responses rated as good or very good quality (≥4) higher for ChatGPT (78.5%) than physicians (22.1%), amounting to a 3.6 times higher prevalence of good or very good quality responses for the chatbot, they noted in .
Furthermore, ChatGPT's responses were rated as being significantly more empathetic than physician responses (t=18.9, P<0.001), with the proportion of responses rated as empathetic or very empathetic (≥4) higher for ChatGPT (45.1%) than for physicians (4.6%), amounting to a 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot.
"ChatGPT provides a better answer," Ayers told Ƶ. "I think of our study as a phase zero study, and it clearly shows that ChatGPT wins in a landslide compared to physicians, and I wouldn't say we expected that at all."
He said they were trying to figure out how ChatGPT, developed by OpenAI, could potentially help resolve the burden of answering patient messages for physicians, which he noted is a well-documented contributor to burnout.
Ayers said that he approached this study with his focus on another population as well, pointing out that the burnout crisis might be affecting roughly 1.1 million providers across the U.S., but it is also affecting about 329 million patients who are engaging with overburdened healthcare professionals.
"There are a lot of people out there asking questions that maybe go unanswered or get bad answers. What do we do to help them?" he said. "I think AI-assisted messaging could be a game changer for public health."
He noted that AI assistant messaging could change patient outcomes, and he wants to see more studies that focus on evaluating these outcomes. He said he hopes this study will motivate more research on this use of AI because of its potential to improve productivity and free up the time of clinical staff for more complex tasks.
In an , Jonathan H. Chen, MD, PhD, of Stanford University School of Medicine in Palo Alto, California, and co-authors highlighted ways for physicians to begin implementing the technology into clinical practice, such as using it to simplify text-based tasks or improve medical training, but cautioned that AI models also have the potential to exacerbate biases and produce other harms.
"Medicine is much more than just processing information and associating words with concepts; it is ascribing meaning to those concepts while connecting with patients as a trusted partner to build healthier lives," they wrote.
In an , Teva D. Brender, MD, of the University of California San Francisco School of Medicine, wrote that the promise of AI to ease the burden of documentation and other common, often repetitive, written tasks, should be weighed against the potential harms, such as adding to "note bloat" or exacerbating existing biases.
"Physicians will need to learn how to integrate these tools into clinical practice, defining clear boundaries between full, supervised, and proscribed autonomy," he added. "And yet, I am cautiously optimistic about a future of improved healthcare system efficiency, better patient outcomes, and reduced burnout."
After seeing the results of this study, Ayers thinks that the research community should be working on randomized controlled trials to study the effects of AI messaging, so that the future development of AI models will be able to account for patient outcomes.
"If we do the studies and if we create the incentives for patient outcomes to become the priority of AI system messaging, then we can discover those benefits and maximize those and we can discover any incidental harms and minimize those," he added. "I'm pretty optimistic about what it could do for people's health."
For this study, the researchers randomly selected 195 exchanges from the Reddit forum in October 2022 in which a verified physician responded to a public question. The researchers entered each original question into a new session of ChatGPT 3.5 in late December, with the physicians anonymized.
Each set of questions and responses were evaluated by three licensed physicians who were asked to choose "which response was better," and to judge "the quality of information provided" and "the empathy or bedside manner provided." They scored each assessment on five tier scales from "very poor" to "very good" for quality and from "not empathetic" to "very empathetic" for empathy.
Mean physician responses were significantly shorter than chatbot responses (52 words vs 211 words; t=25.4, P<0.001).
Ayers and co-authors noted several limitations to their study, including the fact that it was not designed to show how ChatGPT would perform in a clinical setting. In addition, the measures of quality and empathy were not validated, and the evaluators did not assess responses for accuracy.
Disclosures
This work was supported by the Burroughs Wellcome Fund, University of California San Diego PREPARE Institute, and National Institutes of Health. A co-author acknowledged salary support from a grant from the National Institutes on Drug Abuse.
Ayers reported owning equity in companies focused on data analytics, Good Analytics, of which he was CEO until June 2018, and HealthWatcher. Co-authors reported relationships with Bloomberg LP, Sickweather, Good Analytics, Seattle Genetics, LifeLink, Doximity, Linear Therapies, Arena Pharmaceuticals, Model Medicines, Pharma Holdings, Bayer Pharmaceuticals, Evidera, Signant Health, Fluxergy, Lucira, and Kiadis.
Chen reported grants from Stanford Artificial Intelligence in Medicine and Imaging–Human-Centered Artificial Intelligence, the National Institutes of Health/National Institute on Drug Abuse Clinical Trials Network, Google, the Doris Duke Foundation COVID-19 Fund to Retain Clinical Scientists, and the American Heart Association Strategically Focused Research Network–Diversity in Clinical Trials; co-ownership of Reaction Explorer; and personal fees from Younker Hyde Macfarlane and Sutton Pierce. A co-author reported personal fees from Roche and grants from Google.
Brender reported no conflicts of interest.
Primary Source
JAMA Internal Medicine
Ayers JA, et al "Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.1838.
Secondary Source
JAMA Internal Medicine
Li R, et al "How chatbots and large language model artificial intelligence systems will reshape modern medicine: fountain of creativity or Pandora's box?" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.1835.
Additional Source
JAMA Internal Medicine
Brender TD "Medicine in the era of artificial intelligence: Hey chatbot, write me an H&P" JAMA Intern Med 2023; DOI: 10.1001/jamainternmed.2023.1832.