Cargando…

Evaluating large language models on a highly-specialized topic, radiation oncology physics

PURPOSE: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accura...

Descripción completa

Detalles Bibliográficos
Autores principales: Holmes, Jason, Liu, Zhengliang, Zhang, Lian, Ding, Yuzhen, Sio, Terence T., McGee, Lisa A., Ashman, Jonathan B., Li, Xiang, Liu, Tianming, Shen, Jiajian, Liu, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388568/
https://www.ncbi.nlm.nih.gov/pubmed/37529688
http://dx.doi.org/10.3389/fonc.2023.1219326
_version_ 1785082148934385664
author Holmes, Jason
Liu, Zhengliang
Zhang, Lian
Ding, Yuzhen
Sio, Terence T.
McGee, Lisa A.
Ashman, Jonathan B.
Li, Xiang
Liu, Tianming
Shen, Jiajian
Liu, Wei
author_facet Holmes, Jason
Liu, Zhengliang
Zhang, Lian
Ding, Yuzhen
Sio, Terence T.
McGee, Lisa A.
Ashman, Jonathan B.
Li, Xiang
Liu, Tianming
Shen, Jiajian
Liu, Wei
author_sort Holmes, Jason
collection PubMed
description PURPOSE: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. METHODS: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together. RESULTS: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. CONCLUSION: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
format Online
Article
Text
id pubmed-10388568
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-103885682023-08-01 Evaluating large language models on a highly-specialized topic, radiation oncology physics Holmes, Jason Liu, Zhengliang Zhang, Lian Ding, Yuzhen Sio, Terence T. McGee, Lisa A. Ashman, Jonathan B. Li, Xiang Liu, Tianming Shen, Jiajian Liu, Wei Front Oncol Oncology PURPOSE: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. METHODS: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together. RESULTS: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. CONCLUSION: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants. Frontiers Media S.A. 2023-07-17 /pmc/articles/PMC10388568/ /pubmed/37529688 http://dx.doi.org/10.3389/fonc.2023.1219326 Text en Copyright © 2023 Holmes, Liu, Zhang, Ding, Sio, McGee, Ashman, Li, Liu, Shen and Liu https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Oncology
Holmes, Jason
Liu, Zhengliang
Zhang, Lian
Ding, Yuzhen
Sio, Terence T.
McGee, Lisa A.
Ashman, Jonathan B.
Li, Xiang
Liu, Tianming
Shen, Jiajian
Liu, Wei
Evaluating large language models on a highly-specialized topic, radiation oncology physics
title Evaluating large language models on a highly-specialized topic, radiation oncology physics
title_full Evaluating large language models on a highly-specialized topic, radiation oncology physics
title_fullStr Evaluating large language models on a highly-specialized topic, radiation oncology physics
title_full_unstemmed Evaluating large language models on a highly-specialized topic, radiation oncology physics
title_short Evaluating large language models on a highly-specialized topic, radiation oncology physics
title_sort evaluating large language models on a highly-specialized topic, radiation oncology physics
topic Oncology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388568/
https://www.ncbi.nlm.nih.gov/pubmed/37529688
http://dx.doi.org/10.3389/fonc.2023.1219326
work_keys_str_mv AT holmesjason evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT liuzhengliang evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT zhanglian evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT dingyuzhen evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT sioterencet evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT mcgeelisaa evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT ashmanjonathanb evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT lixiang evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT liutianming evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT shenjiajian evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics
AT liuwei evaluatinglargelanguagemodelsonahighlyspecializedtopicradiationoncologyphysics