Cargando…

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

BACKGROUND: Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on se...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beaulieu-Jones, Brendin R, Shah, Sahaj, Berrigan, Margaret T, Marwaha, Jayson S, Lai, Shuo-Lun, Brat, Gabriel A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10371188/ https://www.ncbi.nlm.nih.gov/pubmed/37502981 http://dx.doi.org/10.1101/2023.07.16.23292743

_version_	1785078101771812864
author	Beaulieu-Jones, Brendin R Shah, Sahaj Berrigan, Margaret T Marwaha, Jayson S Lai, Shuo-Lun Brat, Gabriel A
author_facet	Beaulieu-Jones, Brendin R Shah, Sahaj Berrigan, Margaret T Marwaha, Jayson S Lai, Shuo-Lun Brat, Gabriel A
author_sort	Beaulieu-Jones, Brendin R
collection	PubMed
description	BACKGROUND: Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. METHODS: We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. RESULTS: A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. CONCLUSION: Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.
format	Online Article Text
id	pubmed-10371188
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-103711882023-07-27 Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments Beaulieu-Jones, Brendin R Shah, Sahaj Berrigan, Margaret T Marwaha, Jayson S Lai, Shuo-Lun Brat, Gabriel A medRxiv Article BACKGROUND: Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. METHODS: We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. RESULTS: A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. CONCLUSION: Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care. Cold Spring Harbor Laboratory 2023-07-24 /pmc/articles/PMC10371188/ /pubmed/37502981 http://dx.doi.org/10.1101/2023.07.16.23292743 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle	Article Beaulieu-Jones, Brendin R Shah, Sahaj Berrigan, Margaret T Marwaha, Jayson S Lai, Shuo-Lun Brat, Gabriel A Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title	Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title_full	Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title_fullStr	Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title_full_unstemmed	Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title_short	Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments
title_sort	evaluating capabilities of large language models: performance of gpt4 on surgical knowledge assessments
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10371188/ https://www.ncbi.nlm.nih.gov/pubmed/37502981 http://dx.doi.org/10.1101/2023.07.16.23292743
work_keys_str_mv	AT beaulieujonesbrendinr evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments AT shahsahaj evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments AT berriganmargarett evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments AT marwahajaysons evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments AT laishuolun evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments AT bratgabriela evaluatingcapabilitiesoflargelanguagemodelsperformanceofgpt4onsurgicalknowledgeassessments

Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments

Ejemplares similares