Cargando…

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

BACKGROUND: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lim, Zhi Wei, Pushpanathan, Krithi, Yew, Samantha Min Er, Lai, Yien, Sun, Chen-Hsin, Lam, Janice Sing Harn, Chen, David Ziyou, Goh, Jocelyn Hui Lin, Tan, Marcus Chun Jin, Sheng, Bin, Cheng, Ching-Yu, Koh, Victor Teck Chang, Tham, Yih-Chung
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2023
Materias:	Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470220/ https://www.ncbi.nlm.nih.gov/pubmed/37625267 http://dx.doi.org/10.1016/j.ebiom.2023.104770

_version_	1785099637786411008
author	Lim, Zhi Wei Pushpanathan, Krithi Yew, Samantha Min Er Lai, Yien Sun, Chen-Hsin Lam, Janice Sing Harn Chen, David Ziyou Goh, Jocelyn Hui Lin Tan, Marcus Chun Jin Sheng, Bin Cheng, Ching-Yu Koh, Victor Teck Chang Tham, Yih-Chung
author_facet	Lim, Zhi Wei Pushpanathan, Krithi Yew, Samantha Min Er Lai, Yien Sun, Chen-Hsin Lam, Janice Sing Harn Chen, David Ziyou Goh, Jocelyn Hui Lin Tan, Marcus Chun Jin Sheng, Bin Cheng, Ching-Yu Koh, Victor Teck Chang Tham, Yih-Chung
author_sort	Lim, Zhi Wei
collection	PubMed
description	BACKGROUND: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. METHODS: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains—pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy. FINDINGS: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001). INTERPRETATION: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial. FUNDING: Dr Yih-Chung Tham was supported by the 10.13039/501100001349National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).
format	Online Article Text
id	pubmed-10470220
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-104702202023-09-01 Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard Lim, Zhi Wei Pushpanathan, Krithi Yew, Samantha Min Er Lai, Yien Sun, Chen-Hsin Lam, Janice Sing Harn Chen, David Ziyou Goh, Jocelyn Hui Lin Tan, Marcus Chun Jin Sheng, Bin Cheng, Ching-Yu Koh, Victor Teck Chang Tham, Yih-Chung eBioMedicine Articles BACKGROUND: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs’ accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries. METHODS: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains—pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. ‘Good’ rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, ‘poor’ rated responses were further prompted for self-correction and then re-evaluated for accuracy. FINDINGS: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as ‘good’, compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for ‘treatment and prevention’. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% ‘good’ ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001). INTERPRETATION: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs’ accuracy remain crucial. FUNDING: Dr Yih-Chung Tham was supported by the 10.13039/501100001349National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001). Elsevier 2023-08-23 /pmc/articles/PMC10470220/ /pubmed/37625267 http://dx.doi.org/10.1016/j.ebiom.2023.104770 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Articles Lim, Zhi Wei Pushpanathan, Krithi Yew, Samantha Min Er Lai, Yien Sun, Chen-Hsin Lam, Janice Sing Harn Chen, David Ziyou Goh, Jocelyn Hui Lin Tan, Marcus Chun Jin Sheng, Bin Cheng, Ching-Yu Koh, Victor Teck Chang Tham, Yih-Chung Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title_full	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title_fullStr	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title_full_unstemmed	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title_short	Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
title_sort	benchmarking large language models’ performances for myopia care: a comparative analysis of chatgpt-3.5, chatgpt-4.0, and google bard
topic	Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470220/ https://www.ncbi.nlm.nih.gov/pubmed/37625267 http://dx.doi.org/10.1016/j.ebiom.2023.104770
work_keys_str_mv	AT limzhiwei benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT pushpanathankrithi benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT yewsamanthaminer benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT laiyien benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT sunchenhsin benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT lamjanicesingharn benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT chendavidziyou benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT gohjocelynhuilin benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT tanmarcuschunjin benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT shengbin benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT chengchingyu benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT kohvictorteckchang benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard AT thamyihchung benchmarkinglargelanguagemodelsperformancesformyopiacareacomparativeanalysisofchatgpt35chatgpt40andgooglebard

Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Ejemplares similares