Cargando…

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use...

Descripción completa

Detalles Bibliográficos
Autores principales:	Oloruntoba, Ayooluwatomiwa I, Vestergaard, Tine, Nguyen, Toan D, Yu, Zhen, Sashindranath, Maithili, Betz-Stablein, Brigid, Soyer, H Peter, Ge, Zongyuan, Mar, Victoria
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150

_version_	1785070945180844032
author	Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria
author_facet	Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria
author_sort	Oloruntoba, Ayooluwatomiwa I
collection	PubMed
description	BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. METHODS: In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. RESULTS: When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. CONCLUSIONS: CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval.
format	Online Article Text
id	pubmed-10334907
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-103349072023-07-18 Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria JMIR Dermatol Original Paper BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. METHODS: In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. RESULTS: When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. CONCLUSIONS: CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval. JMIR Publications 2022-09-12 /pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150 Text en ©Ayooluwatomiwa I Oloruntoba, Tine Vestergaard, Toan D Nguyen, Zhen Yu, Maithili Sashindranath, Brigid Betz-Stablein, H Peter Soyer, Zongyuan Ge, Victoria Mar. Originally published in JMIR Dermatology (http://derma.jmir.org), 12.09.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Dermatology, is properly cited. The complete bibliographic information, a link to the original publication on http://derma.jmir.org, as well as this copyright and license information must be included.
spellingShingle	Original Paper Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title	Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title_full	Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title_fullStr	Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title_full_unstemmed	Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title_short	Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
title_sort	assessing the generalizability of deep learning models trained on standardized and nonstandardized images and their performance against teledermatologists: retrospective comparative study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150
work_keys_str_mv	AT oloruntobaayooluwatomiwai assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT vestergaardtine assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT nguyentoand assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT yuzhen assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT sashindranathmaithili assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT betzstableinbrigid assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT soyerhpeter assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT gezongyuan assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT marvictoria assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

Ejemplares similares