Cargando…

On the limitations of large language models in clinical diagnosis

BACKGROUND: The potential of large language models (LLM) such as GPT to support complex tasks such as differential diagnosis has been a subject of debate, with some ascribing near sentient abilities to the models and others claiming that LLMs merely perform “autocomplete on steroids”. A recent study...

Descripción completa

Detalles Bibliográficos
Autores principales: Reese, Justin T, Danis, Daniel, Caulfied, J Harry, Casiraghi, Elena, Valentini, Giorgio, Mungall, Christopher J, Robinson, Peter N
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/
https://www.ncbi.nlm.nih.gov/pubmed/37503093
http://dx.doi.org/10.1101/2023.07.13.23292613
_version_ 1785077909155741696
author Reese, Justin T
Danis, Daniel
Caulfied, J Harry
Casiraghi, Elena
Valentini, Giorgio
Mungall, Christopher J
Robinson, Peter N
author_facet Reese, Justin T
Danis, Daniel
Caulfied, J Harry
Casiraghi, Elena
Valentini, Giorgio
Mungall, Christopher J
Robinson, Peter N
author_sort Reese, Justin T
collection PubMed
description BACKGROUND: The potential of large language models (LLM) such as GPT to support complex tasks such as differential diagnosis has been a subject of debate, with some ascribing near sentient abilities to the models and others claiming that LLMs merely perform “autocomplete on steroids”. A recent study reported that the Generative Pretrained Transformer 4 (GPT-4) model performed well in complex differential diagnostic reasoning. The authors assessed the performance of GPT-4 in identifying the correct diagnosis in a series of case records from the New England Journal of Medicine. The authors constructed prompts based on the clinical presentation section of the case reports, and compared the results of GPT-4 to the actual diagnosis. GPT-4 returned the correct diagnosis as a part of its response in 64% of cases, with the correct diagnosis being at rank 1 in 39% of cases. However, such concise but comprehensive narratives of the clinical course are not typically available in electronic health records (EHRs). Further, if they were available, EHR records contain identifying information whose transmission is prohibited by Health Insurance Portability and Accountability Act (HIPAA) regulations. METHODS: To assess the expected performance of GPT on comparable datasets that can be generated by text mining and by design cannot contain identifiable information, we parsed the texts of the case reports and extracted Human Phenotype Ontology (HPO) terms, from which prompts for GPT were constructed that contain largely the same clinical abnormalities but lack the surrounding narrative. RESULTS: While the performance of GPT-4 on the original narrative-based text was good, with the final diagnosis being included in its differential in 29/75 cases (38.7%; rank 1 in 17.3% of cases; mean rank of 3.4), the performance of GPT-4 on the feature-based approach that includes the major clinical abnormalities without additional narrative texas substantially worse, with GPT-4 including the final diagnosis in its differential in 8/75 cases (10.7%; rank 1 in 4.0% of cases; mean rank of 3.9). INTERPRETATION: We consider the feature-based queries to be a more appropriate test of the performance of GPT-4 in diagnostic tasks, since it is unlikely that the narrative approach can be used in actual clinical practice. Future research and algorithmic development is needed to determine the optimal approach to leveraging LLMs for clinical diagnosis.
format Online
Article
Text
id pubmed-10370243
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-103702432023-07-27 On the limitations of large language models in clinical diagnosis Reese, Justin T Danis, Daniel Caulfied, J Harry Casiraghi, Elena Valentini, Giorgio Mungall, Christopher J Robinson, Peter N medRxiv Article BACKGROUND: The potential of large language models (LLM) such as GPT to support complex tasks such as differential diagnosis has been a subject of debate, with some ascribing near sentient abilities to the models and others claiming that LLMs merely perform “autocomplete on steroids”. A recent study reported that the Generative Pretrained Transformer 4 (GPT-4) model performed well in complex differential diagnostic reasoning. The authors assessed the performance of GPT-4 in identifying the correct diagnosis in a series of case records from the New England Journal of Medicine. The authors constructed prompts based on the clinical presentation section of the case reports, and compared the results of GPT-4 to the actual diagnosis. GPT-4 returned the correct diagnosis as a part of its response in 64% of cases, with the correct diagnosis being at rank 1 in 39% of cases. However, such concise but comprehensive narratives of the clinical course are not typically available in electronic health records (EHRs). Further, if they were available, EHR records contain identifying information whose transmission is prohibited by Health Insurance Portability and Accountability Act (HIPAA) regulations. METHODS: To assess the expected performance of GPT on comparable datasets that can be generated by text mining and by design cannot contain identifiable information, we parsed the texts of the case reports and extracted Human Phenotype Ontology (HPO) terms, from which prompts for GPT were constructed that contain largely the same clinical abnormalities but lack the surrounding narrative. RESULTS: While the performance of GPT-4 on the original narrative-based text was good, with the final diagnosis being included in its differential in 29/75 cases (38.7%; rank 1 in 17.3% of cases; mean rank of 3.4), the performance of GPT-4 on the feature-based approach that includes the major clinical abnormalities without additional narrative texas substantially worse, with GPT-4 including the final diagnosis in its differential in 8/75 cases (10.7%; rank 1 in 4.0% of cases; mean rank of 3.9). INTERPRETATION: We consider the feature-based queries to be a more appropriate test of the performance of GPT-4 in diagnostic tasks, since it is unlikely that the narrative approach can be used in actual clinical practice. Future research and algorithmic development is needed to determine the optimal approach to leveraging LLMs for clinical diagnosis. Cold Spring Harbor Laboratory 2023-07-14 /pmc/articles/PMC10370243/ /pubmed/37503093 http://dx.doi.org/10.1101/2023.07.13.23292613 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Reese, Justin T
Danis, Daniel
Caulfied, J Harry
Casiraghi, Elena
Valentini, Giorgio
Mungall, Christopher J
Robinson, Peter N
On the limitations of large language models in clinical diagnosis
title On the limitations of large language models in clinical diagnosis
title_full On the limitations of large language models in clinical diagnosis
title_fullStr On the limitations of large language models in clinical diagnosis
title_full_unstemmed On the limitations of large language models in clinical diagnosis
title_short On the limitations of large language models in clinical diagnosis
title_sort on the limitations of large language models in clinical diagnosis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/
https://www.ncbi.nlm.nih.gov/pubmed/37503093
http://dx.doi.org/10.1101/2023.07.13.23292613
work_keys_str_mv AT reesejustint onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT danisdaniel onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT caulfiedjharry onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT casiraghielena onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT valentinigiorgio onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT mungallchristopherj onthelimitationsoflargelanguagemodelsinclinicaldiagnosis
AT robinsonpetern onthelimitationsoflargelanguagemodelsinclinicaldiagnosis