Cargando…

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties...

Descripción completa

Detalles Bibliográficos
Autores principales: Johnson, Douglas, Goodman, Rachel, Patrinely, J, Stone, Cosby, Zimmerman, Eli, Donald, Rebecca, Chang, Sam, Berkowitz, Sean, Finn, Avni, Jahangir, Eiman, Scoville, Elizabeth, Reese, Tyler, Friedman, Debra, Bastarache, Julie, van der Heijden, Yuri, Wright, Jordan, Carter, Nicholas, Alexander, Matthew, Choe, Jennifer, Chastain, Cody, Zic, John, Horst, Sara, Turker, Isik, Agarwal, Rajiv, Osmundson, Evan, Idrees, Kamran, Kieman, Colleen, Padmanabhan, Chandrasekhar, Bailey, Christina, Schlegel, Cameron, Chambless, Lola, Gibson, Mike, Osterman, Travis, Wheless, Lee
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002821/
https://www.ncbi.nlm.nih.gov/pubmed/36909565
http://dx.doi.org/10.21203/rs.3.rs-2566942/v1
_version_ 1784904467286589440
author Johnson, Douglas
Goodman, Rachel
Patrinely, J
Stone, Cosby
Zimmerman, Eli
Donald, Rebecca
Chang, Sam
Berkowitz, Sean
Finn, Avni
Jahangir, Eiman
Scoville, Elizabeth
Reese, Tyler
Friedman, Debra
Bastarache, Julie
van der Heijden, Yuri
Wright, Jordan
Carter, Nicholas
Alexander, Matthew
Choe, Jennifer
Chastain, Cody
Zic, John
Horst, Sara
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kieman, Colleen
Padmanabhan, Chandrasekhar
Bailey, Christina
Schlegel, Cameron
Chambless, Lola
Gibson, Mike
Osterman, Travis
Wheless, Lee
author_facet Johnson, Douglas
Goodman, Rachel
Patrinely, J
Stone, Cosby
Zimmerman, Eli
Donald, Rebecca
Chang, Sam
Berkowitz, Sean
Finn, Avni
Jahangir, Eiman
Scoville, Elizabeth
Reese, Tyler
Friedman, Debra
Bastarache, Julie
van der Heijden, Yuri
Wright, Jordan
Carter, Nicholas
Alexander, Matthew
Choe, Jennifer
Chastain, Cody
Zic, John
Horst, Sara
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kieman, Colleen
Padmanabhan, Chandrasekhar
Bailey, Christina
Schlegel, Cameron
Chambless, Lola
Gibson, Mike
Osterman, Travis
Wheless, Lee
author_sort Johnson, Douglas
collection PubMed
description BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 – completely incorrect to 6 – completely correct) and completeness (3-point Likert scale; range 1 – incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01). CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.
format Online
Article
Text
id pubmed-10002821
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-100028212023-03-11 Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model Johnson, Douglas Goodman, Rachel Patrinely, J Stone, Cosby Zimmerman, Eli Donald, Rebecca Chang, Sam Berkowitz, Sean Finn, Avni Jahangir, Eiman Scoville, Elizabeth Reese, Tyler Friedman, Debra Bastarache, Julie van der Heijden, Yuri Wright, Jordan Carter, Nicholas Alexander, Matthew Choe, Jennifer Chastain, Cody Zic, John Horst, Sara Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kieman, Colleen Padmanabhan, Chandrasekhar Bailey, Christina Schlegel, Cameron Chambless, Lola Gibson, Mike Osterman, Travis Wheless, Lee Res Sq Article BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 – completely incorrect to 6 – completely correct) and completeness (3-point Likert scale; range 1 – incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01). CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation. American Journal Experts 2023-02-28 /pmc/articles/PMC10002821/ /pubmed/36909565 http://dx.doi.org/10.21203/rs.3.rs-2566942/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Johnson, Douglas
Goodman, Rachel
Patrinely, J
Stone, Cosby
Zimmerman, Eli
Donald, Rebecca
Chang, Sam
Berkowitz, Sean
Finn, Avni
Jahangir, Eiman
Scoville, Elizabeth
Reese, Tyler
Friedman, Debra
Bastarache, Julie
van der Heijden, Yuri
Wright, Jordan
Carter, Nicholas
Alexander, Matthew
Choe, Jennifer
Chastain, Cody
Zic, John
Horst, Sara
Turker, Isik
Agarwal, Rajiv
Osmundson, Evan
Idrees, Kamran
Kieman, Colleen
Padmanabhan, Chandrasekhar
Bailey, Christina
Schlegel, Cameron
Chambless, Lola
Gibson, Mike
Osterman, Travis
Wheless, Lee
Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title_full Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title_fullStr Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title_full_unstemmed Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title_short Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
title_sort assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002821/
https://www.ncbi.nlm.nih.gov/pubmed/36909565
http://dx.doi.org/10.21203/rs.3.rs-2566942/v1
work_keys_str_mv AT johnsondouglas assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT goodmanrachel assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT patrinelyj assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT stonecosby assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT zimmermaneli assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT donaldrebecca assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT changsam assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT berkowitzsean assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT finnavni assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT jahangireiman assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT scovilleelizabeth assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT reesetyler assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT friedmandebra assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT bastarachejulie assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT vanderheijdenyuri assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT wrightjordan assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT carternicholas assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT alexandermatthew assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT choejennifer assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT chastaincody assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT zicjohn assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT horstsara assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT turkerisik assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT agarwalrajiv assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT osmundsonevan assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT idreeskamran assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT kiemancolleen assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT padmanabhanchandrasekhar assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT baileychristina assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT schlegelcameron assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT chamblesslola assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT gibsonmike assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT ostermantravis assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel
AT whelesslee assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel