Cargando…

Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations

INTRODUCTION: Artificial intelligence (AI) programs have the ability to answer complex queries including medical profession examination questions. The purpose of this study was to compare the performance of orthopaedic residents (ortho residents) against Chat Generative Pretrained Transformer (ChatG...

Descripción completa

Detalles Bibliográficos
Autores principales:	Massey, Patrick A., Montgomery, Carver, Zhang, Andrew S
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Lippincott Williams & Wilkins 2023
Materias:	Reviews
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627532/ https://www.ncbi.nlm.nih.gov/pubmed/37671415 http://dx.doi.org/10.5435/JAAOS-D-23-00396

_version_	1785131546010714112
author	Massey, Patrick A. Montgomery, Carver Zhang, Andrew S
author_facet	Massey, Patrick A. Montgomery, Carver Zhang, Andrew S
author_sort	Massey, Patrick A.
collection	PubMed
description	INTRODUCTION: Artificial intelligence (AI) programs have the ability to answer complex queries including medical profession examination questions. The purpose of this study was to compare the performance of orthopaedic residents (ortho residents) against Chat Generative Pretrained Transformer (ChatGPT)-3.5 and GPT-4 on orthopaedic assessment examinations. A secondary objective was to perform a subgroup analysis comparing the performance of each group on questions that included image interpretation versus text-only questions. METHODS: The ResStudy orthopaedic examination question bank was used as the primary source of questions. One hundred eighty questions and answer choices from nine different orthopaedic subspecialties were directly input into ChatGPT-3.5 and then GPT-4. ChatGPT did not have consistently available image interpretation, so no images were directly provided to either AI format. Answers were recorded as correct versus incorrect by the chatbot, and resident performance was recorded based on user data provided by ResStudy. RESULTS: Overall, ChatGPT-3.5, GPT-4, and ortho residents scored 29.4%, 47.2%, and 74.2%, respectively. There was a difference among the three groups in testing success, with ortho residents scoring higher than ChatGPT-3.5 and GPT-4 (P < 0.001 and P < 0.001). GPT-4 scored higher than ChatGPT-3.5 (P = 0.002). A subgroup analysis was performed by dividing questions into question stems without images and question stems with images. ChatGPT-3.5 was more correct (37.8% vs. 22.4%, respectively, OR = 2.1, P = 0.033) and ChatGPT-4 was also more correct (61.0% vs. 35.7%, OR = 2.8, P < 0.001), when comparing text-only questions versus questions with images. Residents were 72.6% versus 75.5% correct with text-only questions versus questions with images, with no significant difference (P = 0.302). CONCLUSION: Orthopaedic residents were able to answer more questions accurately than ChatGPT-3.5 and GPT-4 on orthopaedic assessment examinations. GPT-4 is superior to ChatGPT-3.5 for answering orthopaedic resident assessment examination questions. Both ChatGPT-3.5 and GPT-4 performed better on text-only questions than questions with images. It is unlikely that GPT-4 or ChatGPT-3.5 would pass the American Board of Orthopaedic Surgery written examination.
format	Online Article Text
id	pubmed-10627532
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Lippincott Williams & Wilkins
record_format	MEDLINE/PubMed
spelling	pubmed-106275322023-11-07 Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations Massey, Patrick A. Montgomery, Carver Zhang, Andrew S J Am Acad Orthop Surg Reviews INTRODUCTION: Artificial intelligence (AI) programs have the ability to answer complex queries including medical profession examination questions. The purpose of this study was to compare the performance of orthopaedic residents (ortho residents) against Chat Generative Pretrained Transformer (ChatGPT)-3.5 and GPT-4 on orthopaedic assessment examinations. A secondary objective was to perform a subgroup analysis comparing the performance of each group on questions that included image interpretation versus text-only questions. METHODS: The ResStudy orthopaedic examination question bank was used as the primary source of questions. One hundred eighty questions and answer choices from nine different orthopaedic subspecialties were directly input into ChatGPT-3.5 and then GPT-4. ChatGPT did not have consistently available image interpretation, so no images were directly provided to either AI format. Answers were recorded as correct versus incorrect by the chatbot, and resident performance was recorded based on user data provided by ResStudy. RESULTS: Overall, ChatGPT-3.5, GPT-4, and ortho residents scored 29.4%, 47.2%, and 74.2%, respectively. There was a difference among the three groups in testing success, with ortho residents scoring higher than ChatGPT-3.5 and GPT-4 (P < 0.001 and P < 0.001). GPT-4 scored higher than ChatGPT-3.5 (P = 0.002). A subgroup analysis was performed by dividing questions into question stems without images and question stems with images. ChatGPT-3.5 was more correct (37.8% vs. 22.4%, respectively, OR = 2.1, P = 0.033) and ChatGPT-4 was also more correct (61.0% vs. 35.7%, OR = 2.8, P < 0.001), when comparing text-only questions versus questions with images. Residents were 72.6% versus 75.5% correct with text-only questions versus questions with images, with no significant difference (P = 0.302). CONCLUSION: Orthopaedic residents were able to answer more questions accurately than ChatGPT-3.5 and GPT-4 on orthopaedic assessment examinations. GPT-4 is superior to ChatGPT-3.5 for answering orthopaedic resident assessment examination questions. Both ChatGPT-3.5 and GPT-4 performed better on text-only questions than questions with images. It is unlikely that GPT-4 or ChatGPT-3.5 would pass the American Board of Orthopaedic Surgery written examination. Lippincott Williams & Wilkins 2023-12-01 2023-09-04 /pmc/articles/PMC10627532/ /pubmed/37671415 http://dx.doi.org/10.5435/JAAOS-D-23-00396 Text en Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the American Academy of Orthopaedic Surgeons. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) , where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.
spellingShingle	Reviews Massey, Patrick A. Montgomery, Carver Zhang, Andrew S Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title	Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title_full	Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title_fullStr	Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title_full_unstemmed	Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title_short	Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations
title_sort	comparison of chatgpt–3.5, chatgpt-4, and orthopaedic resident performance on orthopaedic assessment examinations
topic	Reviews
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627532/ https://www.ncbi.nlm.nih.gov/pubmed/37671415 http://dx.doi.org/10.5435/JAAOS-D-23-00396
work_keys_str_mv	AT masseypatricka comparisonofchatgpt35chatgpt4andorthopaedicresidentperformanceonorthopaedicassessmentexaminations AT montgomerycarver comparisonofchatgpt35chatgpt4andorthopaedicresidentperformanceonorthopaedicassessmentexaminations AT zhangandrews comparisonofchatgpt35chatgpt4andorthopaedicresidentperformanceonorthopaedicassessmentexaminations

Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations

Ejemplares similares