Cargando…
Evaluating Large Language Models on Medical Evidence Summarization
Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specificall...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168498/ https://www.ncbi.nlm.nih.gov/pubmed/37162998 http://dx.doi.org/10.1101/2023.04.22.23288967 |
_version_ | 1785038866277728256 |
---|---|
author | Tang, Liyan Sun, Zhaoyi Idnay, Betina Nestor, Jordan G Soroush, Ali Elias, Pierre A. Xu, Ziyang Ding, Ying Durrett, Greg Rousseau, Justin Weng, Chunhua Peng, Yifan |
author_facet | Tang, Liyan Sun, Zhaoyi Idnay, Betina Nestor, Jordan G Soroush, Ali Elias, Pierre A. Xu, Ziyang Ding, Ying Durrett, Greg Rousseau, Justin Weng, Chunhua Peng, Yifan |
author_sort | Tang, Liyan |
collection | PubMed |
description | Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts. |
format | Online Article Text |
id | pubmed-10168498 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-101684982023-05-10 Evaluating Large Language Models on Medical Evidence Summarization Tang, Liyan Sun, Zhaoyi Idnay, Betina Nestor, Jordan G Soroush, Ali Elias, Pierre A. Xu, Ziyang Ding, Ying Durrett, Greg Rousseau, Justin Weng, Chunhua Peng, Yifan medRxiv Article Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts. Cold Spring Harbor Laboratory 2023-04-24 /pmc/articles/PMC10168498/ /pubmed/37162998 http://dx.doi.org/10.1101/2023.04.22.23288967 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Tang, Liyan Sun, Zhaoyi Idnay, Betina Nestor, Jordan G Soroush, Ali Elias, Pierre A. Xu, Ziyang Ding, Ying Durrett, Greg Rousseau, Justin Weng, Chunhua Peng, Yifan Evaluating Large Language Models on Medical Evidence Summarization |
title | Evaluating Large Language Models on Medical Evidence Summarization |
title_full | Evaluating Large Language Models on Medical Evidence Summarization |
title_fullStr | Evaluating Large Language Models on Medical Evidence Summarization |
title_full_unstemmed | Evaluating Large Language Models on Medical Evidence Summarization |
title_short | Evaluating Large Language Models on Medical Evidence Summarization |
title_sort | evaluating large language models on medical evidence summarization |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168498/ https://www.ncbi.nlm.nih.gov/pubmed/37162998 http://dx.doi.org/10.1101/2023.04.22.23288967 |
work_keys_str_mv | AT tangliyan evaluatinglargelanguagemodelsonmedicalevidencesummarization AT sunzhaoyi evaluatinglargelanguagemodelsonmedicalevidencesummarization AT idnaybetina evaluatinglargelanguagemodelsonmedicalevidencesummarization AT nestorjordang evaluatinglargelanguagemodelsonmedicalevidencesummarization AT soroushali evaluatinglargelanguagemodelsonmedicalevidencesummarization AT eliaspierrea evaluatinglargelanguagemodelsonmedicalevidencesummarization AT xuziyang evaluatinglargelanguagemodelsonmedicalevidencesummarization AT dingying evaluatinglargelanguagemodelsonmedicalevidencesummarization AT durrettgreg evaluatinglargelanguagemodelsonmedicalevidencesummarization AT rousseaujustin evaluatinglargelanguagemodelsonmedicalevidencesummarization AT wengchunhua evaluatinglargelanguagemodelsonmedicalevidencesummarization AT pengyifan evaluatinglargelanguagemodelsonmedicalevidencesummarization |