Cargando…

Large language models encode clinical knowledge

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a...

Descripción completa

Detalles Bibliográficos
Autores principales: Singhal, Karan, Azizi, Shekoofeh, Tu, Tao, Mahdavi, S. Sara, Wei, Jason, Chung, Hyung Won, Scales, Nathan, Tanwani, Ajay, Cole-Lewis, Heather, Pfohl, Stephen, Payne, Perry, Seneviratne, Martin, Gamble, Paul, Kelly, Chris, Babiker, Abubakr, Schärli, Nathanael, Chowdhery, Aakanksha, Mansfield, Philip, Demner-Fushman, Dina, Agüera y Arcas, Blaise, Webster, Dale, Corrado, Greg S., Matias, Yossi, Chou, Katherine, Gottweis, Juraj, Tomasev, Nenad, Liu, Yun, Rajkomar, Alvin, Barral, Joelle, Semturs, Christopher, Karthikesalingam, Alan, Natarajan, Vivek
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10396962/
https://www.ncbi.nlm.nih.gov/pubmed/37438534
http://dx.doi.org/10.1038/s41586-023-06291-2
_version_ 1785083846916571136
author Singhal, Karan
Azizi, Shekoofeh
Tu, Tao
Mahdavi, S. Sara
Wei, Jason
Chung, Hyung Won
Scales, Nathan
Tanwani, Ajay
Cole-Lewis, Heather
Pfohl, Stephen
Payne, Perry
Seneviratne, Martin
Gamble, Paul
Kelly, Chris
Babiker, Abubakr
Schärli, Nathanael
Chowdhery, Aakanksha
Mansfield, Philip
Demner-Fushman, Dina
Agüera y Arcas, Blaise
Webster, Dale
Corrado, Greg S.
Matias, Yossi
Chou, Katherine
Gottweis, Juraj
Tomasev, Nenad
Liu, Yun
Rajkomar, Alvin
Barral, Joelle
Semturs, Christopher
Karthikesalingam, Alan
Natarajan, Vivek
author_facet Singhal, Karan
Azizi, Shekoofeh
Tu, Tao
Mahdavi, S. Sara
Wei, Jason
Chung, Hyung Won
Scales, Nathan
Tanwani, Ajay
Cole-Lewis, Heather
Pfohl, Stephen
Payne, Perry
Seneviratne, Martin
Gamble, Paul
Kelly, Chris
Babiker, Abubakr
Schärli, Nathanael
Chowdhery, Aakanksha
Mansfield, Philip
Demner-Fushman, Dina
Agüera y Arcas, Blaise
Webster, Dale
Corrado, Greg S.
Matias, Yossi
Chou, Katherine
Gottweis, Juraj
Tomasev, Nenad
Liu, Yun
Rajkomar, Alvin
Barral, Joelle
Semturs, Christopher
Karthikesalingam, Alan
Natarajan, Vivek
author_sort Singhal, Karan
collection PubMed
description Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM(2) on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
format Online
Article
Text
id pubmed-10396962
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-103969622023-08-04 Large language models encode clinical knowledge Singhal, Karan Azizi, Shekoofeh Tu, Tao Mahdavi, S. Sara Wei, Jason Chung, Hyung Won Scales, Nathan Tanwani, Ajay Cole-Lewis, Heather Pfohl, Stephen Payne, Perry Seneviratne, Martin Gamble, Paul Kelly, Chris Babiker, Abubakr Schärli, Nathanael Chowdhery, Aakanksha Mansfield, Philip Demner-Fushman, Dina Agüera y Arcas, Blaise Webster, Dale Corrado, Greg S. Matias, Yossi Chou, Katherine Gottweis, Juraj Tomasev, Nenad Liu, Yun Rajkomar, Alvin Barral, Joelle Semturs, Christopher Karthikesalingam, Alan Natarajan, Vivek Nature Article Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM(2) on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. Nature Publishing Group UK 2023-07-12 2023 /pmc/articles/PMC10396962/ /pubmed/37438534 http://dx.doi.org/10.1038/s41586-023-06291-2 Text en © The Author(s) 2023, corrected publication 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Singhal, Karan
Azizi, Shekoofeh
Tu, Tao
Mahdavi, S. Sara
Wei, Jason
Chung, Hyung Won
Scales, Nathan
Tanwani, Ajay
Cole-Lewis, Heather
Pfohl, Stephen
Payne, Perry
Seneviratne, Martin
Gamble, Paul
Kelly, Chris
Babiker, Abubakr
Schärli, Nathanael
Chowdhery, Aakanksha
Mansfield, Philip
Demner-Fushman, Dina
Agüera y Arcas, Blaise
Webster, Dale
Corrado, Greg S.
Matias, Yossi
Chou, Katherine
Gottweis, Juraj
Tomasev, Nenad
Liu, Yun
Rajkomar, Alvin
Barral, Joelle
Semturs, Christopher
Karthikesalingam, Alan
Natarajan, Vivek
Large language models encode clinical knowledge
title Large language models encode clinical knowledge
title_full Large language models encode clinical knowledge
title_fullStr Large language models encode clinical knowledge
title_full_unstemmed Large language models encode clinical knowledge
title_short Large language models encode clinical knowledge
title_sort large language models encode clinical knowledge
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10396962/
https://www.ncbi.nlm.nih.gov/pubmed/37438534
http://dx.doi.org/10.1038/s41586-023-06291-2
work_keys_str_mv AT singhalkaran largelanguagemodelsencodeclinicalknowledge
AT azizishekoofeh largelanguagemodelsencodeclinicalknowledge
AT tutao largelanguagemodelsencodeclinicalknowledge
AT mahdavissara largelanguagemodelsencodeclinicalknowledge
AT weijason largelanguagemodelsencodeclinicalknowledge
AT chunghyungwon largelanguagemodelsencodeclinicalknowledge
AT scalesnathan largelanguagemodelsencodeclinicalknowledge
AT tanwaniajay largelanguagemodelsencodeclinicalknowledge
AT colelewisheather largelanguagemodelsencodeclinicalknowledge
AT pfohlstephen largelanguagemodelsencodeclinicalknowledge
AT payneperry largelanguagemodelsencodeclinicalknowledge
AT seneviratnemartin largelanguagemodelsencodeclinicalknowledge
AT gamblepaul largelanguagemodelsencodeclinicalknowledge
AT kellychris largelanguagemodelsencodeclinicalknowledge
AT babikerabubakr largelanguagemodelsencodeclinicalknowledge
AT scharlinathanael largelanguagemodelsencodeclinicalknowledge
AT chowdheryaakanksha largelanguagemodelsencodeclinicalknowledge
AT mansfieldphilip largelanguagemodelsencodeclinicalknowledge
AT demnerfushmandina largelanguagemodelsencodeclinicalknowledge
AT aguerayarcasblaise largelanguagemodelsencodeclinicalknowledge
AT websterdale largelanguagemodelsencodeclinicalknowledge
AT corradogregs largelanguagemodelsencodeclinicalknowledge
AT matiasyossi largelanguagemodelsencodeclinicalknowledge
AT choukatherine largelanguagemodelsencodeclinicalknowledge
AT gottweisjuraj largelanguagemodelsencodeclinicalknowledge
AT tomasevnenad largelanguagemodelsencodeclinicalknowledge
AT liuyun largelanguagemodelsencodeclinicalknowledge
AT rajkomaralvin largelanguagemodelsencodeclinicalknowledge
AT barraljoelle largelanguagemodelsencodeclinicalknowledge
AT semturschristopher largelanguagemodelsencodeclinicalknowledge
AT karthikesalingamalan largelanguagemodelsencodeclinicalknowledge
AT natarajanvivek largelanguagemodelsencodeclinicalknowledge