Cargando…

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United Stat...

Descripción completa

Detalles Bibliográficos
Autores principales: Gilson, Aidan, Safranek, Conrad W, Huang, Thomas, Socrates, Vimig, Chi, Ling, Taylor, Richard Andrew, Chartash, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/
https://www.ncbi.nlm.nih.gov/pubmed/36753318
http://dx.doi.org/10.2196/45312
_version_ 1784892632234721280
author Gilson, Aidan
Safranek, Conrad W
Huang, Thomas
Socrates, Vimig
Chi, Ling
Taylor, Richard Andrew
Chartash, David
author_facet Gilson, Aidan
Safranek, Conrad W
Huang, Thomas
Socrates, Vimig
Chi, Ling
Taylor, Richard Andrew
Chartash, David
author_sort Gilson, Aidan
collection PubMed
description BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. METHODS: We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. RESULTS: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. CONCLUSIONS: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.
format Online
Article
Text
id pubmed-9947764
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-99477642023-02-24 How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment Gilson, Aidan Safranek, Conrad W Huang, Thomas Socrates, Vimig Chi, Ling Taylor, Richard Andrew Chartash, David JMIR Med Educ Original Paper BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. METHODS: We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. RESULTS: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. CONCLUSIONS: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. JMIR Publications 2023-02-08 /pmc/articles/PMC9947764/ /pubmed/36753318 http://dx.doi.org/10.2196/45312 Text en ©Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash. Originally published in JMIR Medical Education (https://mededu.jmir.org), 08.02.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Gilson, Aidan
Safranek, Conrad W
Huang, Thomas
Socrates, Vimig
Chi, Ling
Taylor, Richard Andrew
Chartash, David
How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title_full How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title_fullStr How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title_full_unstemmed How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title_short How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment
title_sort how does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/
https://www.ncbi.nlm.nih.gov/pubmed/36753318
http://dx.doi.org/10.2196/45312
work_keys_str_mv AT gilsonaidan howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT safranekconradw howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT huangthomas howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT socratesvimig howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT chiling howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT taylorrichardandrew howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment
AT chartashdavid howdoeschatgptperformontheunitedstatesmedicallicensingexaminationtheimplicationsoflargelanguagemodelsformedicaleducationandknowledgeassessment