Cargando…

Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly...

Descripción completa

Detalles Bibliográficos
Autores principales: Pushpanathan, Krithi, Lim, Zhi Wei, Er Yew, Samantha Min, Chen, David Ziyou, Hui'En Lin, Hazel Anne, Lin Goh, Jocelyn Hui, Wong, Wendy Meihua, Wang, Xiaofei, Jin Tan, Marcus Chun, Chang Koh, Victor Teck, Tham, Yih-Chung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10616302/
https://www.ncbi.nlm.nih.gov/pubmed/37915603
http://dx.doi.org/10.1016/j.isci.2023.108163
Descripción
Sumario:In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were ‘good’-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.