Cargando…
Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model
BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties...
Autores principales: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Journal Experts
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002821/ https://www.ncbi.nlm.nih.gov/pubmed/36909565 http://dx.doi.org/10.21203/rs.3.rs-2566942/v1 |
_version_ | 1784904467286589440 |
---|---|
author | Johnson, Douglas Goodman, Rachel Patrinely, J Stone, Cosby Zimmerman, Eli Donald, Rebecca Chang, Sam Berkowitz, Sean Finn, Avni Jahangir, Eiman Scoville, Elizabeth Reese, Tyler Friedman, Debra Bastarache, Julie van der Heijden, Yuri Wright, Jordan Carter, Nicholas Alexander, Matthew Choe, Jennifer Chastain, Cody Zic, John Horst, Sara Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kieman, Colleen Padmanabhan, Chandrasekhar Bailey, Christina Schlegel, Cameron Chambless, Lola Gibson, Mike Osterman, Travis Wheless, Lee |
author_facet | Johnson, Douglas Goodman, Rachel Patrinely, J Stone, Cosby Zimmerman, Eli Donald, Rebecca Chang, Sam Berkowitz, Sean Finn, Avni Jahangir, Eiman Scoville, Elizabeth Reese, Tyler Friedman, Debra Bastarache, Julie van der Heijden, Yuri Wright, Jordan Carter, Nicholas Alexander, Matthew Choe, Jennifer Chastain, Cody Zic, John Horst, Sara Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kieman, Colleen Padmanabhan, Chandrasekhar Bailey, Christina Schlegel, Cameron Chambless, Lola Gibson, Mike Osterman, Travis Wheless, Lee |
author_sort | Johnson, Douglas |
collection | PubMed |
description | BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 – completely incorrect to 6 – completely correct) and completeness (3-point Likert scale; range 1 – incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01). CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation. |
format | Online Article Text |
id | pubmed-10002821 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Journal Experts |
record_format | MEDLINE/PubMed |
spelling | pubmed-100028212023-03-11 Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model Johnson, Douglas Goodman, Rachel Patrinely, J Stone, Cosby Zimmerman, Eli Donald, Rebecca Chang, Sam Berkowitz, Sean Finn, Avni Jahangir, Eiman Scoville, Elizabeth Reese, Tyler Friedman, Debra Bastarache, Julie van der Heijden, Yuri Wright, Jordan Carter, Nicholas Alexander, Matthew Choe, Jennifer Chastain, Cody Zic, John Horst, Sara Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kieman, Colleen Padmanabhan, Chandrasekhar Bailey, Christina Schlegel, Cameron Chambless, Lola Gibson, Mike Osterman, Travis Wheless, Lee Res Sq Article BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 – completely incorrect to 6 – completely correct) and completeness (3-point Likert scale; range 1 – incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01). CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation. American Journal Experts 2023-02-28 /pmc/articles/PMC10002821/ /pubmed/36909565 http://dx.doi.org/10.21203/rs.3.rs-2566942/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Johnson, Douglas Goodman, Rachel Patrinely, J Stone, Cosby Zimmerman, Eli Donald, Rebecca Chang, Sam Berkowitz, Sean Finn, Avni Jahangir, Eiman Scoville, Elizabeth Reese, Tyler Friedman, Debra Bastarache, Julie van der Heijden, Yuri Wright, Jordan Carter, Nicholas Alexander, Matthew Choe, Jennifer Chastain, Cody Zic, John Horst, Sara Turker, Isik Agarwal, Rajiv Osmundson, Evan Idrees, Kamran Kieman, Colleen Padmanabhan, Chandrasekhar Bailey, Christina Schlegel, Cameron Chambless, Lola Gibson, Mike Osterman, Travis Wheless, Lee Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title | Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title_full | Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title_fullStr | Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title_full_unstemmed | Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title_short | Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model |
title_sort | assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002821/ https://www.ncbi.nlm.nih.gov/pubmed/36909565 http://dx.doi.org/10.21203/rs.3.rs-2566942/v1 |
work_keys_str_mv | AT johnsondouglas assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT goodmanrachel assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT patrinelyj assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT stonecosby assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT zimmermaneli assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT donaldrebecca assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT changsam assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT berkowitzsean assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT finnavni assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT jahangireiman assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT scovilleelizabeth assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT reesetyler assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT friedmandebra assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT bastarachejulie assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT vanderheijdenyuri assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT wrightjordan assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT carternicholas assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT alexandermatthew assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT choejennifer assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT chastaincody assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT zicjohn assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT horstsara assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT turkerisik assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT agarwalrajiv assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT osmundsonevan assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT idreeskamran assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT kiemancolleen assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT padmanabhanchandrasekhar assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT baileychristina assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT schlegelcameron assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT chamblesslola assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT gibsonmike assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT ostermantravis assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel AT whelesslee assessingtheaccuracyandreliabilityofaigeneratedmedicalresponsesanevaluationofthechatgptmodel |