Cargando…

Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study

BACKGROUND: Large language models (LLMs) are emerging artificial intelligence (AI) technologies refining research and healthcare. However, the impact of these models on presurgical planning and education remains under-explored. OBJECTIVES: This study aims to assess 3 prominent LLMs—Google's AI...

Descripción completa

Detalles Bibliográficos
Autores principales: Seth, Ishith, Lim, Bryan, Xie, Yi, Cevik, Jevan, Rozen, Warren M, Ross, Richard J, Lee, Mathew
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547367/
https://www.ncbi.nlm.nih.gov/pubmed/37795257
http://dx.doi.org/10.1093/asjof/ojad084
_version_ 1785115047445397504
author Seth, Ishith
Lim, Bryan
Xie, Yi
Cevik, Jevan
Rozen, Warren M
Ross, Richard J
Lee, Mathew
author_facet Seth, Ishith
Lim, Bryan
Xie, Yi
Cevik, Jevan
Rozen, Warren M
Ross, Richard J
Lee, Mathew
author_sort Seth, Ishith
collection PubMed
description BACKGROUND: Large language models (LLMs) are emerging artificial intelligence (AI) technologies refining research and healthcare. However, the impact of these models on presurgical planning and education remains under-explored. OBJECTIVES: This study aims to assess 3 prominent LLMs—Google's AI BARD (Mountain View, CA), Bing AI (Microsoft, Redmond, WA), and ChatGPT-3.5 (Open AI, San Francisco, CA) in providing safe medical information for rhinoplasty. METHODS: Six questions regarding rhinoplasty were prompted to ChatGPT, BARD, and Bing AI. A Likert scale was used to evaluate these responses by a panel of Specialist Plastic and Reconstructive Surgeons with extensive experience in rhinoplasty. To measure reliability, the Flesch Reading Ease Score, the Flesch–Kincaid Grade Level, and the Coleman–Liau Index were used. The modified DISCERN score was chosen as the criterion for assessing suitability and reliability. A t test was performed to calculate the difference between the LLMs, and a double-sided P-value <.05 was considered statistically significant. RESULTS: In terms of reliability, BARD and ChatGPT demonstrated a significantly (P < .05) greater Flesch Reading Ease Score of 47.47 (±15.32) and 37.68 (±12.96), Flesch–Kincaid Grade Level of 9.7 (±3.12) and 10.15 (±1.84), and a Coleman–Liau Index of 10.83 (±2.14) and 12.17 (±1.17) than Bing AI. In terms of suitability, BARD (46.3 ± 2.8) demonstrated a significantly greater DISCERN score than ChatGPT and Bing AI. In terms of Likert score, ChatGPT and BARD demonstrated similar scores and were greater than Bing AI. CONCLUSIONS: BARD delivered the most succinct and comprehensible information, followed by ChatGPT and Bing AI. Although these models demonstrate potential, challenges regarding their depth and specificity remain. Therefore, future research should aim to augment LLM performance through the integration of specialized databases and expert knowledge, while also refining their algorithms. LEVEL OF EVIDENCE: 5: [Image: see text]
format Online
Article
Text
id pubmed-10547367
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105473672023-10-04 Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study Seth, Ishith Lim, Bryan Xie, Yi Cevik, Jevan Rozen, Warren M Ross, Richard J Lee, Mathew Aesthet Surg J Open Forum Original Article BACKGROUND: Large language models (LLMs) are emerging artificial intelligence (AI) technologies refining research and healthcare. However, the impact of these models on presurgical planning and education remains under-explored. OBJECTIVES: This study aims to assess 3 prominent LLMs—Google's AI BARD (Mountain View, CA), Bing AI (Microsoft, Redmond, WA), and ChatGPT-3.5 (Open AI, San Francisco, CA) in providing safe medical information for rhinoplasty. METHODS: Six questions regarding rhinoplasty were prompted to ChatGPT, BARD, and Bing AI. A Likert scale was used to evaluate these responses by a panel of Specialist Plastic and Reconstructive Surgeons with extensive experience in rhinoplasty. To measure reliability, the Flesch Reading Ease Score, the Flesch–Kincaid Grade Level, and the Coleman–Liau Index were used. The modified DISCERN score was chosen as the criterion for assessing suitability and reliability. A t test was performed to calculate the difference between the LLMs, and a double-sided P-value <.05 was considered statistically significant. RESULTS: In terms of reliability, BARD and ChatGPT demonstrated a significantly (P < .05) greater Flesch Reading Ease Score of 47.47 (±15.32) and 37.68 (±12.96), Flesch–Kincaid Grade Level of 9.7 (±3.12) and 10.15 (±1.84), and a Coleman–Liau Index of 10.83 (±2.14) and 12.17 (±1.17) than Bing AI. In terms of suitability, BARD (46.3 ± 2.8) demonstrated a significantly greater DISCERN score than ChatGPT and Bing AI. In terms of Likert score, ChatGPT and BARD demonstrated similar scores and were greater than Bing AI. CONCLUSIONS: BARD delivered the most succinct and comprehensible information, followed by ChatGPT and Bing AI. Although these models demonstrate potential, challenges regarding their depth and specificity remain. Therefore, future research should aim to augment LLM performance through the integration of specialized databases and expert knowledge, while also refining their algorithms. LEVEL OF EVIDENCE: 5: [Image: see text] Oxford University Press 2023-09-14 /pmc/articles/PMC10547367/ /pubmed/37795257 http://dx.doi.org/10.1093/asjof/ojad084 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of The Aesthetic Society. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Seth, Ishith
Lim, Bryan
Xie, Yi
Cevik, Jevan
Rozen, Warren M
Ross, Richard J
Lee, Mathew
Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title_full Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title_fullStr Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title_full_unstemmed Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title_short Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
title_sort comparing the efficacy of large language models chatgpt, bard, and bing ai in providing information on rhinoplasty: an observational study
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547367/
https://www.ncbi.nlm.nih.gov/pubmed/37795257
http://dx.doi.org/10.1093/asjof/ojad084
work_keys_str_mv AT sethishith comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT limbryan comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT xieyi comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT cevikjevan comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT rozenwarrenm comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT rossrichardj comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy
AT leemathew comparingtheefficacyoflargelanguagemodelschatgptbardandbingaiinprovidinginformationonrhinoplastyanobservationalstudy