Cargando…

Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study

BACKGROUND: Deep learning (DL)–based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conv...

Descripción completa

Detalles Bibliográficos
Autores principales: Cha, Dongchul, Pae, Chongwon, Lee, Se A, Na, Gina, Hur, Young Kyun, Lee, Ho Young, Cho, A Ra, Cho, Young Joon, Han, Sang Gil, Kim, Sung Huhn, Choi, Jae Young, Park, Hae-Jeong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8701703/
https://www.ncbi.nlm.nih.gov/pubmed/34889764
http://dx.doi.org/10.2196/33049
_version_ 1784621067002707968
author Cha, Dongchul
Pae, Chongwon
Lee, Se A
Na, Gina
Hur, Young Kyun
Lee, Ho Young
Cho, A Ra
Cho, Young Joon
Han, Sang Gil
Kim, Sung Huhn
Choi, Jae Young
Park, Hae-Jeong
author_facet Cha, Dongchul
Pae, Chongwon
Lee, Se A
Na, Gina
Hur, Young Kyun
Lee, Ho Young
Cho, A Ra
Cho, Young Joon
Han, Sang Gil
Kim, Sung Huhn
Choi, Jae Young
Park, Hae-Jeong
author_sort Cha, Dongchul
collection PubMed
description BACKGROUND: Deep learning (DL)–based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application. OBJECTIVE: This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems. METHODS: We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models. RESULTS: Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07). CONCLUSIONS: Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.
format Online
Article
Text
id pubmed-8701703
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-87017032022-01-10 Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study Cha, Dongchul Pae, Chongwon Lee, Se A Na, Gina Hur, Young Kyun Lee, Ho Young Cho, A Ra Cho, Young Joon Han, Sang Gil Kim, Sung Huhn Choi, Jae Young Park, Hae-Jeong JMIR Med Inform Original Paper BACKGROUND: Deep learning (DL)–based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application. OBJECTIVE: This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems. METHODS: We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models. RESULTS: Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07). CONCLUSIONS: Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation. JMIR Publications 2021-12-08 /pmc/articles/PMC8701703/ /pubmed/34889764 http://dx.doi.org/10.2196/33049 Text en ©Dongchul Cha, Chongwon Pae, Se A Lee, Gina Na, Young Kyun Hur, Ho Young Lee, A Ra Cho, Young Joon Cho, Sang Gil Han, Sung Huhn Kim, Jae Young Choi, Hae-Jeong Park. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 08.12.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Cha, Dongchul
Pae, Chongwon
Lee, Se A
Na, Gina
Hur, Young Kyun
Lee, Ho Young
Cho, A Ra
Cho, Young Joon
Han, Sang Gil
Kim, Sung Huhn
Choi, Jae Young
Park, Hae-Jeong
Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title_full Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title_fullStr Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title_full_unstemmed Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title_short Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study
title_sort differential biases and variabilities of deep learning–based artificial intelligence and human experts in clinical diagnosis: retrospective cohort and survey study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8701703/
https://www.ncbi.nlm.nih.gov/pubmed/34889764
http://dx.doi.org/10.2196/33049
work_keys_str_mv AT chadongchul differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT paechongwon differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT leesea differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT nagina differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT huryoungkyun differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT leehoyoung differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT choara differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT choyoungjoon differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT hansanggil differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT kimsunghuhn differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT choijaeyoung differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy
AT parkhaejeong differentialbiasesandvariabilitiesofdeeplearningbasedartificialintelligenceandhumanexpertsinclinicaldiagnosisretrospectivecohortandsurveystudy