Cargando…

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

INTRODUCTION: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study...

Descripción completa

Detalles Bibliográficos
Autores principales: Lai, U Hin, Wu, Keng Sam, Hsu, Ting-Yu, Kan, Jessie Kai Ching
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547055/
https://www.ncbi.nlm.nih.gov/pubmed/37795422
http://dx.doi.org/10.3389/fmed.2023.1240915
_version_ 1785114980746526720
author Lai, U Hin
Wu, Keng Sam
Hsu, Ting-Yu
Kan, Jessie Kai Ching
author_facet Lai, U Hin
Wu, Keng Sam
Hsu, Ting-Yu
Kan, Jessie Kai Ching
author_sort Lai, U Hin
collection PubMed
description INTRODUCTION: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA). METHODS: Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week). RESULTS: ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively. DISCUSSION AND CONCLUSION: This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision.
format Online
Article
Text
id pubmed-10547055
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-105470552023-10-04 Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment Lai, U Hin Wu, Keng Sam Hsu, Ting-Yu Kan, Jessie Kai Ching Front Med (Lausanne) Medicine INTRODUCTION: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA). METHODS: Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week). RESULTS: ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively. DISCUSSION AND CONCLUSION: This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision. Frontiers Media S.A. 2023-09-19 /pmc/articles/PMC10547055/ /pubmed/37795422 http://dx.doi.org/10.3389/fmed.2023.1240915 Text en Copyright © 2023 Lai, Wu, Hsu and Kan. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Medicine
Lai, U Hin
Wu, Keng Sam
Hsu, Ting-Yu
Kan, Jessie Kai Ching
Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title_full Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title_fullStr Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title_full_unstemmed Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title_short Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment
title_sort evaluating the performance of chatgpt-4 on the united kingdom medical licensing assessment
topic Medicine
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547055/
https://www.ncbi.nlm.nih.gov/pubmed/37795422
http://dx.doi.org/10.3389/fmed.2023.1240915
work_keys_str_mv AT laiuhin evaluatingtheperformanceofchatgpt4ontheunitedkingdommedicallicensingassessment
AT wukengsam evaluatingtheperformanceofchatgpt4ontheunitedkingdommedicallicensingassessment
AT hsutingyu evaluatingtheperformanceofchatgpt4ontheunitedkingdommedicallicensingassessment
AT kanjessiekaiching evaluatingtheperformanceofchatgpt4ontheunitedkingdommedicallicensingassessment