Cargando…

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

PURPOSE: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the o...

Descripción completa

Detalles Bibliográficos
Autores principales: Antaki, Fares, Touma, Samir, Milad, Daniel, El-Khoury, Jonathan, Duval, Renaud
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10272508/
https://www.ncbi.nlm.nih.gov/pubmed/37334036
http://dx.doi.org/10.1016/j.xops.2023.100324
_version_ 1785059510433349632
author Antaki, Fares
Touma, Samir
Milad, Daniel
El-Khoury, Jonathan
Duval, Renaud
author_facet Antaki, Fares
Touma, Samir
Milad, Daniel
El-Khoury, Jonathan
Duval, Renaud
author_sort Antaki, Fares
collection PubMed
description PURPOSE: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. DESIGN: Evaluation of diagnostic test or technology. PARTICIPANTS: ChatGPT is a publicly available LLM. METHODS: We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties. MAIN OUTCOME MEASURES: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. RESULTS: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. CONCLUSION: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. FINANCIAL DISCLOSURE(S): Proprietary or commercial disclosure may be found after the references.
format Online
Article
Text
id pubmed-10272508
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-102725082023-06-17 Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings Antaki, Fares Touma, Samir Milad, Daniel El-Khoury, Jonathan Duval, Renaud Ophthalmol Sci Original Article PURPOSE: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. DESIGN: Evaluation of diagnostic test or technology. PARTICIPANTS: ChatGPT is a publicly available LLM. METHODS: We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties. MAIN OUTCOME MEASURES: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. RESULTS: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. CONCLUSION: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. FINANCIAL DISCLOSURE(S): Proprietary or commercial disclosure may be found after the references. Elsevier 2023-05-05 /pmc/articles/PMC10272508/ /pubmed/37334036 http://dx.doi.org/10.1016/j.xops.2023.100324 Text en © 2023 by the American Academy of Ophthalmology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Original Article
Antaki, Fares
Touma, Samir
Milad, Daniel
El-Khoury, Jonathan
Duval, Renaud
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title_full Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title_fullStr Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title_full_unstemmed Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title_short Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings
title_sort evaluating the performance of chatgpt in ophthalmology: an analysis of its successes and shortcomings
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10272508/
https://www.ncbi.nlm.nih.gov/pubmed/37334036
http://dx.doi.org/10.1016/j.xops.2023.100324
work_keys_str_mv AT antakifares evaluatingtheperformanceofchatgptinophthalmologyananalysisofitssuccessesandshortcomings
AT toumasamir evaluatingtheperformanceofchatgptinophthalmologyananalysisofitssuccessesandshortcomings
AT miladdaniel evaluatingtheperformanceofchatgptinophthalmologyananalysisofitssuccessesandshortcomings
AT elkhouryjonathan evaluatingtheperformanceofchatgptinophthalmologyananalysisofitssuccessesandshortcomings
AT duvalrenaud evaluatingtheperformanceofchatgptinophthalmologyananalysisofitssuccessesandshortcomings