Cargando…

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis a...

Descripción completa

Detalles Bibliográficos
Autores principales: Levine, David M, Tuwani, Rudraksh, Kompa, Benjamin, Varma, Amita, Finlayson, Samuel G., Mehrotra, Ateev, Beam, Andrew
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9915829/
https://www.ncbi.nlm.nih.gov/pubmed/36778449
http://dx.doi.org/10.1101/2023.01.30.23285067
_version_ 1784885980996567040
author Levine, David M
Tuwani, Rudraksh
Kompa, Benjamin
Varma, Amita
Finlayson, Samuel G.
Mehrotra, Ateev
Beam, Andrew
author_facet Levine, David M
Tuwani, Rudraksh
Kompa, Benjamin
Varma, Amita
Finlayson, Samuel G.
Mehrotra, Ateev
Beam, Andrew
author_sort Levine, David M
collection PubMed
description IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. OBJECTIVE: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet. DESIGN: We compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage. SETTING AND PARTICIPANTS: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. EXPOSURE: Validated case vignettes (<60 words; <6(th) grade reading level). MAIN OUTCOMES AND MEASURES: Correct diagnosis, correct triage. RESULTS: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). CONCLUSIONS AND RELEVANCE: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.
format Online
Article
Text
id pubmed-9915829
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-99158292023-02-11 The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model Levine, David M Tuwani, Rudraksh Kompa, Benjamin Varma, Amita Finlayson, Samuel G. Mehrotra, Ateev Beam, Andrew medRxiv Article IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. OBJECTIVE: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet. DESIGN: We compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage. SETTING AND PARTICIPANTS: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. EXPOSURE: Validated case vignettes (<60 words; <6(th) grade reading level). MAIN OUTCOMES AND MEASURES: Correct diagnosis, correct triage. RESULTS: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). CONCLUSIONS AND RELEVANCE: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals. Cold Spring Harbor Laboratory 2023-02-01 /pmc/articles/PMC9915829/ /pubmed/36778449 http://dx.doi.org/10.1101/2023.01.30.23285067 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Levine, David M
Tuwani, Rudraksh
Kompa, Benjamin
Varma, Amita
Finlayson, Samuel G.
Mehrotra, Ateev
Beam, Andrew
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title_full The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title_fullStr The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title_full_unstemmed The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title_short The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
title_sort diagnostic and triage accuracy of the gpt-3 artificial intelligence model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9915829/
https://www.ncbi.nlm.nih.gov/pubmed/36778449
http://dx.doi.org/10.1101/2023.01.30.23285067
work_keys_str_mv AT levinedavidm thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT tuwanirudraksh thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT kompabenjamin thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT varmaamita thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT finlaysonsamuelg thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT mehrotraateev thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT beamandrew thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT levinedavidm diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT tuwanirudraksh diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT kompabenjamin diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT varmaamita diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT finlaysonsamuelg diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT mehrotraateev diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel
AT beamandrew diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel