Cargando…

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

BACKGROUND: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national l...

Descripción completa

Detalles Bibliográficos
Autores principales:	Flores-Cohaila, Javier A, García-Vicente, Abigaíl, Vizcarra-Jiménez, Sonia F, De la Cruz-Galán, Janith P, Gutiérrez-Arratia, Jesús D, Quiroga Torres, Blanca Geraldine, Taype-Rondan, Alvaro
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570896/ https://www.ncbi.nlm.nih.gov/pubmed/37768724 http://dx.doi.org/10.2196/48039

_version_	1785119869425942528
author	Flores-Cohaila, Javier A García-Vicente, Abigaíl Vizcarra-Jiménez, Sonia F De la Cruz-Galán, Janith P Gutiérrez-Arratia, Jesús D Quiroga Torres, Blanca Geraldine Taype-Rondan, Alvaro
author_facet	Flores-Cohaila, Javier A García-Vicente, Abigaíl Vizcarra-Jiménez, Sonia F De la Cruz-Galán, Janith P Gutiérrez-Arratia, Jesús D Quiroga Torres, Blanca Geraldine Taype-Rondan, Alvaro
author_sort	Flores-Cohaila, Javier A
collection	PubMed
description	BACKGROUND: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. OBJECTIVE: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. METHODS: We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT’s accuracy. RESULTS: GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). CONCLUSIONS: Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.
format	Online Article Text
id	pubmed-10570896
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-105708962023-10-14 Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study Flores-Cohaila, Javier A García-Vicente, Abigaíl Vizcarra-Jiménez, Sonia F De la Cruz-Galán, Janith P Gutiérrez-Arratia, Jesús D Quiroga Torres, Blanca Geraldine Taype-Rondan, Alvaro JMIR Med Educ Original Paper BACKGROUND: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. OBJECTIVE: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. METHODS: We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT’s accuracy. RESULTS: GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). CONCLUSIONS: Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy. JMIR Publications 2023-09-28 /pmc/articles/PMC10570896/ /pubmed/37768724 http://dx.doi.org/10.2196/48039 Text en ©Javier A Flores-Cohaila, Abigaíl García-Vicente, Sonia F Vizcarra-Jiménez, Janith P De la Cruz-Galán, Jesús D Gutiérrez-Arratia, Blanca Geraldine Quiroga Torres, Alvaro Taype-Rondan. Originally published in JMIR Medical Education (https://mededu.jmir.org), 28.09.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Flores-Cohaila, Javier A García-Vicente, Abigaíl Vizcarra-Jiménez, Sonia F De la Cruz-Galán, Janith P Gutiérrez-Arratia, Jesús D Quiroga Torres, Blanca Geraldine Taype-Rondan, Alvaro Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title	Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title_full	Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title_fullStr	Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title_full_unstemmed	Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title_short	Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
title_sort	performance of chatgpt on the peruvian national licensing medical examination: cross-sectional study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570896/ https://www.ncbi.nlm.nih.gov/pubmed/37768724 http://dx.doi.org/10.2196/48039
work_keys_str_mv	AT florescohailajaviera performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT garciavicenteabigail performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT vizcarrajimenezsoniaf performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT delacruzgalanjanithp performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT gutierrezarratiajesusd performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT quirogatorresblancageraldine performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy AT tayperondanalvaro performanceofchatgptontheperuviannationallicensingmedicalexaminationcrosssectionalstudy

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Ejemplares similares