Cargando…

Evaluating Large Language Models on Medical Evidence Summarization

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specificall...

Descripción completa

Detalles Bibliográficos
Autores principales: Tang, Liyan, Sun, Zhaoyi, Idnay, Betina, Nestor, Jordan G, Soroush, Ali, Elias, Pierre A., Xu, Ziyang, Ding, Ying, Durrett, Greg, Rousseau, Justin, Weng, Chunhua, Peng, Yifan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168498/
https://www.ncbi.nlm.nih.gov/pubmed/37162998
http://dx.doi.org/10.1101/2023.04.22.23288967
_version_ 1785038866277728256
author Tang, Liyan
Sun, Zhaoyi
Idnay, Betina
Nestor, Jordan G
Soroush, Ali
Elias, Pierre A.
Xu, Ziyang
Ding, Ying
Durrett, Greg
Rousseau, Justin
Weng, Chunhua
Peng, Yifan
author_facet Tang, Liyan
Sun, Zhaoyi
Idnay, Betina
Nestor, Jordan G
Soroush, Ali
Elias, Pierre A.
Xu, Ziyang
Ding, Ying
Durrett, Greg
Rousseau, Justin
Weng, Chunhua
Peng, Yifan
author_sort Tang, Liyan
collection PubMed
description Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.
format Online
Article
Text
id pubmed-10168498
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-101684982023-05-10 Evaluating Large Language Models on Medical Evidence Summarization Tang, Liyan Sun, Zhaoyi Idnay, Betina Nestor, Jordan G Soroush, Ali Elias, Pierre A. Xu, Ziyang Ding, Ying Durrett, Greg Rousseau, Justin Weng, Chunhua Peng, Yifan medRxiv Article Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts. Cold Spring Harbor Laboratory 2023-04-24 /pmc/articles/PMC10168498/ /pubmed/37162998 http://dx.doi.org/10.1101/2023.04.22.23288967 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Tang, Liyan
Sun, Zhaoyi
Idnay, Betina
Nestor, Jordan G
Soroush, Ali
Elias, Pierre A.
Xu, Ziyang
Ding, Ying
Durrett, Greg
Rousseau, Justin
Weng, Chunhua
Peng, Yifan
Evaluating Large Language Models on Medical Evidence Summarization
title Evaluating Large Language Models on Medical Evidence Summarization
title_full Evaluating Large Language Models on Medical Evidence Summarization
title_fullStr Evaluating Large Language Models on Medical Evidence Summarization
title_full_unstemmed Evaluating Large Language Models on Medical Evidence Summarization
title_short Evaluating Large Language Models on Medical Evidence Summarization
title_sort evaluating large language models on medical evidence summarization
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168498/
https://www.ncbi.nlm.nih.gov/pubmed/37162998
http://dx.doi.org/10.1101/2023.04.22.23288967
work_keys_str_mv AT tangliyan evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT sunzhaoyi evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT idnaybetina evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT nestorjordang evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT soroushali evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT eliaspierrea evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT xuziyang evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT dingying evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT durrettgreg evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT rousseaujustin evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT wengchunhua evaluatinglargelanguagemodelsonmedicalevidencesummarization
AT pengyifan evaluatinglargelanguagemodelsonmedicalevidencesummarization