Cargando…
Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study
BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150 |
_version_ | 1785070945180844032 |
---|---|
author | Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria |
author_facet | Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria |
author_sort | Oloruntoba, Ayooluwatomiwa I |
collection | PubMed |
description | BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. METHODS: In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. RESULTS: When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. CONCLUSIONS: CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval. |
format | Online Article Text |
id | pubmed-10334907 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-103349072023-07-18 Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria JMIR Dermatol Original Paper BACKGROUND: Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. OBJECTIVE: The aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. METHODS: In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. RESULTS: When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. CONCLUSIONS: CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval. JMIR Publications 2022-09-12 /pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150 Text en ©Ayooluwatomiwa I Oloruntoba, Tine Vestergaard, Toan D Nguyen, Zhen Yu, Maithili Sashindranath, Brigid Betz-Stablein, H Peter Soyer, Zongyuan Ge, Victoria Mar. Originally published in JMIR Dermatology (http://derma.jmir.org), 12.09.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Dermatology, is properly cited. The complete bibliographic information, a link to the original publication on http://derma.jmir.org, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Oloruntoba, Ayooluwatomiwa I Vestergaard, Tine Nguyen, Toan D Yu, Zhen Sashindranath, Maithili Betz-Stablein, Brigid Soyer, H Peter Ge, Zongyuan Mar, Victoria Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title | Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title_full | Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title_fullStr | Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title_full_unstemmed | Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title_short | Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study |
title_sort | assessing the generalizability of deep learning models trained on standardized and nonstandardized images and their performance against teledermatologists: retrospective comparative study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10334907/ http://dx.doi.org/10.2196/35150 |
work_keys_str_mv | AT oloruntobaayooluwatomiwai assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT vestergaardtine assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT nguyentoand assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT yuzhen assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT sashindranathmaithili assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT betzstableinbrigid assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT soyerhpeter assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT gezongyuan assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy AT marvictoria assessingthegeneralizabilityofdeeplearningmodelstrainedonstandardizedandnonstandardizedimagesandtheirperformanceagainstteledermatologistsretrospectivecomparativestudy |