Cargando…

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

BACKGROUND: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safet...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wilhelm, Theresa Isabelle, Roos, Jonas, Kaczmarczyk, Robert
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10644179/ https://www.ncbi.nlm.nih.gov/pubmed/37902826 http://dx.doi.org/10.2196/49324

_version_	1785134497757396992
author	Wilhelm, Theresa Isabelle Roos, Jonas Kaczmarczyk, Robert
author_facet	Wilhelm, Theresa Isabelle Roos, Jonas Kaczmarczyk, Robert
author_sort	Wilhelm, Theresa Isabelle
collection	PubMed
description	BACKGROUND: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties. OBJECTIVE: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology. METHODS: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI’s most advanced model, GPT-4, an automated evaluation of each model’s responses to the diseases was performed using the same criteria and compared to the physicians’ assessments through Pearson correlation analysis. RESULTS: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4’s evaluations across all established criteria (P<.01). CONCLUSIONS: This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy (“How to treat…”) and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study’s findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments.
format	Online Article Text
id	pubmed-10644179
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-106441792023-10-30 Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study Wilhelm, Theresa Isabelle Roos, Jonas Kaczmarczyk, Robert J Med Internet Res Original Paper BACKGROUND: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties. OBJECTIVE: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology. METHODS: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI’s most advanced model, GPT-4, an automated evaluation of each model’s responses to the diseases was performed using the same criteria and compared to the physicians’ assessments through Pearson correlation analysis. RESULTS: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4’s evaluations across all established criteria (P<.01). CONCLUSIONS: This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy (“How to treat…”) and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study’s findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments. JMIR Publications 2023-10-30 /pmc/articles/PMC10644179/ /pubmed/37902826 http://dx.doi.org/10.2196/49324 Text en ©Theresa Isabelle Wilhelm, Jonas Roos, Robert Kaczmarczyk. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 30.10.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Wilhelm, Theresa Isabelle Roos, Jonas Kaczmarczyk, Robert Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title	Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title_full	Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title_fullStr	Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title_full_unstemmed	Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title_short	Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
title_sort	large language models for therapy recommendations across 3 clinical specialties: comparative study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10644179/ https://www.ncbi.nlm.nih.gov/pubmed/37902826 http://dx.doi.org/10.2196/49324
work_keys_str_mv	AT wilhelmtheresaisabelle largelanguagemodelsfortherapyrecommendationsacross3clinicalspecialtiescomparativestudy AT roosjonas largelanguagemodelsfortherapyrecommendationsacross3clinicalspecialtiescomparativestudy AT kaczmarczykrobert largelanguagemodelsfortherapyrecommendationsacross3clinicalspecialtiescomparativestudy

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

Ejemplares similares