Cargando…
The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model
IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis a...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9915829/ https://www.ncbi.nlm.nih.gov/pubmed/36778449 http://dx.doi.org/10.1101/2023.01.30.23285067 |
_version_ | 1784885980996567040 |
---|---|
author | Levine, David M Tuwani, Rudraksh Kompa, Benjamin Varma, Amita Finlayson, Samuel G. Mehrotra, Ateev Beam, Andrew |
author_facet | Levine, David M Tuwani, Rudraksh Kompa, Benjamin Varma, Amita Finlayson, Samuel G. Mehrotra, Ateev Beam, Andrew |
author_sort | Levine, David M |
collection | PubMed |
description | IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. OBJECTIVE: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet. DESIGN: We compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage. SETTING AND PARTICIPANTS: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. EXPOSURE: Validated case vignettes (<60 words; <6(th) grade reading level). MAIN OUTCOMES AND MEASURES: Correct diagnosis, correct triage. RESULTS: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). CONCLUSIONS AND RELEVANCE: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals. |
format | Online Article Text |
id | pubmed-9915829 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-99158292023-02-11 The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model Levine, David M Tuwani, Rudraksh Kompa, Benjamin Varma, Amita Finlayson, Samuel G. Mehrotra, Ateev Beam, Andrew medRxiv Article IMPORTANCE: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. OBJECTIVE: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet. DESIGN: We compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage. SETTING AND PARTICIPANTS: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. EXPOSURE: Validated case vignettes (<60 words; <6(th) grade reading level). MAIN OUTCOMES AND MEASURES: Correct diagnosis, correct triage. RESULTS: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). CONCLUSIONS AND RELEVANCE: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals. Cold Spring Harbor Laboratory 2023-02-01 /pmc/articles/PMC9915829/ /pubmed/36778449 http://dx.doi.org/10.1101/2023.01.30.23285067 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Levine, David M Tuwani, Rudraksh Kompa, Benjamin Varma, Amita Finlayson, Samuel G. Mehrotra, Ateev Beam, Andrew The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title | The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title_full | The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title_fullStr | The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title_full_unstemmed | The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title_short | The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model |
title_sort | diagnostic and triage accuracy of the gpt-3 artificial intelligence model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9915829/ https://www.ncbi.nlm.nih.gov/pubmed/36778449 http://dx.doi.org/10.1101/2023.01.30.23285067 |
work_keys_str_mv | AT levinedavidm thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT tuwanirudraksh thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT kompabenjamin thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT varmaamita thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT finlaysonsamuelg thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT mehrotraateev thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT beamandrew thediagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT levinedavidm diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT tuwanirudraksh diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT kompabenjamin diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT varmaamita diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT finlaysonsamuelg diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT mehrotraateev diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel AT beamandrew diagnosticandtriageaccuracyofthegpt3artificialintelligencemodel |