_version_ 1785114832249290752
author Goodman, Rachel S.
Patrinely, J. Randall
Stone, Cosby A.
Zimmerman, Eli
Donald, Rebecca R.
Chang, Sam S.
Berkowitz, Sean T.
Finn, Avni P.
Jahangir, Eiman
Scoville, Elizabeth A.
Reese, Tyler S.
Friedman, Debra L.
Bastarache, Julie A.
van der Heijden, Yuri F.
Wright, Jordan J.
Ye, Fei
Carter, Nicholas
Alexander, Matthew R.
Choe, Jennifer H.
Chastain, Cody A.
Zic, John A.
Horst, Sara N.
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kiernan, Colleen M.
Padmanabhan, Chandrasekhar
Bailey, Christina E.
Schlegel, Cameron E.
Chambless, Lola B.
Gibson, Michael K.
Osterman, Travis J.
Wheless, Lee E.
Johnson, Douglas B.
author_facet Goodman, Rachel S.
Patrinely, J. Randall
Stone, Cosby A.
Zimmerman, Eli
Donald, Rebecca R.
Chang, Sam S.
Berkowitz, Sean T.
Finn, Avni P.
Jahangir, Eiman
Scoville, Elizabeth A.
Reese, Tyler S.
Friedman, Debra L.
Bastarache, Julie A.
van der Heijden, Yuri F.
Wright, Jordan J.
Ye, Fei
Carter, Nicholas
Alexander, Matthew R.
Choe, Jennifer H.
Chastain, Cody A.
Zic, John A.
Horst, Sara N.
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kiernan, Colleen M.
Padmanabhan, Chandrasekhar
Bailey, Christina E.
Schlegel, Cameron E.
Chambless, Lola B.
Gibson, Michael K.
Osterman, Travis J.
Wheless, Lee E.
Johnson, Douglas B.
author_sort Goodman, Rachel S.
collection PubMed
description IMPORTANCE: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. OBJECTIVE: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence–generated medical information. DESIGN, SETTING, AND PARTICIPANTS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. MAIN OUTCOMES AND MEASURES: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. RESULTS: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). CONCLUSIONS AND RELEVANCE: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.
format Online
Article
Text
id pubmed-10546234
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Medical Association
record_format MEDLINE/PubMed
spelling pubmed-105462342023-10-04 Accuracy and Reliability of Chatbot Responses to Physician Questions Goodman, Rachel S. Patrinely, J. Randall Stone, Cosby A. Zimmerman, Eli Donald, Rebecca R. Chang, Sam S. Berkowitz, Sean T. Finn, Avni P. Jahangir, Eiman Scoville, Elizabeth A. Reese, Tyler S. Friedman, Debra L. Bastarache, Julie A. van der Heijden, Yuri F. Wright, Jordan J. Ye, Fei Carter, Nicholas Alexander, Matthew R. Choe, Jennifer H. Chastain, Cody A. Zic, John A. Horst, Sara N. Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kiernan, Colleen M. Padmanabhan, Chandrasekhar Bailey, Christina E. Schlegel, Cameron E. Chambless, Lola B. Gibson, Michael K. Osterman, Travis J. Wheless, Lee E. Johnson, Douglas B. JAMA Netw Open Original Investigation IMPORTANCE: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. OBJECTIVE: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence–generated medical information. DESIGN, SETTING, AND PARTICIPANTS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. MAIN OUTCOMES AND MEASURES: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. RESULTS: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002). CONCLUSIONS AND RELEVANCE: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation. American Medical Association 2023-10-02 /pmc/articles/PMC10546234/ /pubmed/37782499 http://dx.doi.org/10.1001/jamanetworkopen.2023.36483 Text en Copyright 2023 Goodman RS et al. JAMA Network Open. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the CC-BY License.
spellingShingle Original Investigation
Goodman, Rachel S.
Patrinely, J. Randall
Stone, Cosby A.
Zimmerman, Eli
Donald, Rebecca R.
Chang, Sam S.
Berkowitz, Sean T.
Finn, Avni P.
Jahangir, Eiman
Scoville, Elizabeth A.
Reese, Tyler S.
Friedman, Debra L.
Bastarache, Julie A.
van der Heijden, Yuri F.
Wright, Jordan J.
Ye, Fei
Carter, Nicholas
Alexander, Matthew R.
Choe, Jennifer H.
Chastain, Cody A.
Zic, John A.
Horst, Sara N.
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kiernan, Colleen M.
Padmanabhan, Chandrasekhar
Bailey, Christina E.
Schlegel, Cameron E.
Chambless, Lola B.
Gibson, Michael K.
Osterman, Travis J.
Wheless, Lee E.
Johnson, Douglas B.
Accuracy and Reliability of Chatbot Responses to Physician Questions
title Accuracy and Reliability of Chatbot Responses to Physician Questions
title_full Accuracy and Reliability of Chatbot Responses to Physician Questions
title_fullStr Accuracy and Reliability of Chatbot Responses to Physician Questions
title_full_unstemmed Accuracy and Reliability of Chatbot Responses to Physician Questions
title_short Accuracy and Reliability of Chatbot Responses to Physician Questions
title_sort accuracy and reliability of chatbot responses to physician questions
topic Original Investigation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10546234/
https://www.ncbi.nlm.nih.gov/pubmed/37782499
http://dx.doi.org/10.1001/jamanetworkopen.2023.36483
work_keys_str_mv AT goodmanrachels accuracyandreliabilityofchatbotresponsestophysicianquestions
AT patrinelyjrandall accuracyandreliabilityofchatbotresponsestophysicianquestions
AT stonecosbya accuracyandreliabilityofchatbotresponsestophysicianquestions
AT zimmermaneli accuracyandreliabilityofchatbotresponsestophysicianquestions
AT donaldrebeccar accuracyandreliabilityofchatbotresponsestophysicianquestions
AT changsams accuracyandreliabilityofchatbotresponsestophysicianquestions
AT berkowitzseant accuracyandreliabilityofchatbotresponsestophysicianquestions
AT finnavnip accuracyandreliabilityofchatbotresponsestophysicianquestions
AT jahangireiman accuracyandreliabilityofchatbotresponsestophysicianquestions
AT scovilleelizabetha accuracyandreliabilityofchatbotresponsestophysicianquestions
AT reesetylers accuracyandreliabilityofchatbotresponsestophysicianquestions
AT friedmandebral accuracyandreliabilityofchatbotresponsestophysicianquestions
AT bastarachejuliea accuracyandreliabilityofchatbotresponsestophysicianquestions
AT vanderheijdenyurif accuracyandreliabilityofchatbotresponsestophysicianquestions
AT wrightjordanj accuracyandreliabilityofchatbotresponsestophysicianquestions
AT yefei accuracyandreliabilityofchatbotresponsestophysicianquestions
AT carternicholas accuracyandreliabilityofchatbotresponsestophysicianquestions
AT alexandermatthewr accuracyandreliabilityofchatbotresponsestophysicianquestions
AT choejenniferh accuracyandreliabilityofchatbotresponsestophysicianquestions
AT chastaincodya accuracyandreliabilityofchatbotresponsestophysicianquestions
AT zicjohna accuracyandreliabilityofchatbotresponsestophysicianquestions
AT horstsaran accuracyandreliabilityofchatbotresponsestophysicianquestions
AT turkerisik accuracyandreliabilityofchatbotresponsestophysicianquestions
AT agarwalrajiv accuracyandreliabilityofchatbotresponsestophysicianquestions
AT osmundsonevan accuracyandreliabilityofchatbotresponsestophysicianquestions
AT idreeskamran accuracyandreliabilityofchatbotresponsestophysicianquestions
AT kiernancolleenm accuracyandreliabilityofchatbotresponsestophysicianquestions
AT padmanabhanchandrasekhar accuracyandreliabilityofchatbotresponsestophysicianquestions
AT baileychristinae accuracyandreliabilityofchatbotresponsestophysicianquestions
AT schlegelcamerone accuracyandreliabilityofchatbotresponsestophysicianquestions
AT chamblesslolab accuracyandreliabilityofchatbotresponsestophysicianquestions
AT gibsonmichaelk accuracyandreliabilityofchatbotresponsestophysicianquestions
AT ostermantravisj accuracyandreliabilityofchatbotresponsestophysicianquestions
AT whelessleee accuracyandreliabilityofchatbotresponsestophysicianquestions
AT johnsondouglasb accuracyandreliabilityofchatbotresponsestophysicianquestions