Cargando…

The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

BACKGROUND: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. OBJECTIVE: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health condit...

Descripción completa

Detalles Bibliográficos
Autores principales: Ito, Naoki, Kadomatsu, Sakina, Fujisawa, Mineto, Fukaguchi, Kiyomitsu, Ishizawa, Ryo, Kanda, Naoki, Kasugai, Daisuke, Nakajima, Mikio, Goto, Tadahiro, Tsugawa, Yusuke
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10654908/
https://www.ncbi.nlm.nih.gov/pubmed/37917120
http://dx.doi.org/10.2196/47532
_version_ 1785136713201352704
author Ito, Naoki
Kadomatsu, Sakina
Fujisawa, Mineto
Fukaguchi, Kiyomitsu
Ishizawa, Ryo
Kanda, Naoki
Kasugai, Daisuke
Nakajima, Mikio
Goto, Tadahiro
Tsugawa, Yusuke
author_facet Ito, Naoki
Kadomatsu, Sakina
Fujisawa, Mineto
Fukaguchi, Kiyomitsu
Ishizawa, Ryo
Kanda, Naoki
Kasugai, Daisuke
Nakajima, Mikio
Goto, Tadahiro
Tsugawa, Yusuke
author_sort Ito, Naoki
collection PubMed
description BACKGROUND: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. OBJECTIVE: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. METHODS: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. RESULTS: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. CONCLUSIONS: GPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.
format Online
Article
Text
id pubmed-10654908
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-106549082023-11-02 The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study Ito, Naoki Kadomatsu, Sakina Fujisawa, Mineto Fukaguchi, Kiyomitsu Ishizawa, Ryo Kanda, Naoki Kasugai, Daisuke Nakajima, Mikio Goto, Tadahiro Tsugawa, Yusuke JMIR Med Educ Original Paper BACKGROUND: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. OBJECTIVE: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. METHODS: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. RESULTS: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. CONCLUSIONS: GPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. JMIR Publications 2023-11-02 /pmc/articles/PMC10654908/ /pubmed/37917120 http://dx.doi.org/10.2196/47532 Text en ©Naoki Ito, Sakina Kadomatsu, Mineto Fujisawa, Kiyomitsu Fukaguchi, Ryo Ishizawa, Naoki Kanda, Daisuke Kasugai, Mikio Nakajima, Tadahiro Goto, Yusuke Tsugawa. Originally published in JMIR Medical Education (https://mededu.jmir.org), 02.11.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited. The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Ito, Naoki
Kadomatsu, Sakina
Fujisawa, Mineto
Fukaguchi, Kiyomitsu
Ishizawa, Ryo
Kanda, Naoki
Kasugai, Daisuke
Nakajima, Mikio
Goto, Tadahiro
Tsugawa, Yusuke
The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_full The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_fullStr The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_full_unstemmed The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_short The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study
title_sort accuracy and potential racial and ethnic biases of gpt-4 in the diagnosis and triage of health conditions: evaluation study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10654908/
https://www.ncbi.nlm.nih.gov/pubmed/37917120
http://dx.doi.org/10.2196/47532
work_keys_str_mv AT itonaoki theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kadomatsusakina theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT fujisawamineto theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT fukaguchikiyomitsu theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT ishizawaryo theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kandanaoki theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kasugaidaisuke theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT nakajimamikio theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT gototadahiro theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT tsugawayusuke theaccuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT itonaoki accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kadomatsusakina accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT fujisawamineto accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT fukaguchikiyomitsu accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT ishizawaryo accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kandanaoki accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT kasugaidaisuke accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT nakajimamikio accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT gototadahiro accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy
AT tsugawayusuke accuracyandpotentialracialandethnicbiasesofgpt4inthediagnosisandtriageofhealthconditionsevaluationstudy