Cargando…

Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge

BACKGROUND: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presente...

Descripción completa

Detalles Bibliográficos
Autores principales: Combalia, Marc, Codella, Noel, Rotemberg, Veronica, Carrera, Cristina, Dusza, Stephen, Gutman, David, Helba, Brian, Kittler, Harald, Kurtansky, Nicholas R, Liopyris, Konstantinos, Marchetti, Michael A, Podlipnik, Sebastian, Puig, Susana, Rinner, Christoph, Tschandl, Philipp, Weber, Jochen, Halpern, Allan, Malvehy, Josep
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9295694/
https://www.ncbi.nlm.nih.gov/pubmed/35461690
http://dx.doi.org/10.1016/S2589-7500(22)00021-8
_version_ 1784750102213033984
author Combalia, Marc
Codella, Noel
Rotemberg, Veronica
Carrera, Cristina
Dusza, Stephen
Gutman, David
Helba, Brian
Kittler, Harald
Kurtansky, Nicholas R
Liopyris, Konstantinos
Marchetti, Michael A
Podlipnik, Sebastian
Puig, Susana
Rinner, Christoph
Tschandl, Philipp
Weber, Jochen
Halpern, Allan
Malvehy, Josep
author_facet Combalia, Marc
Codella, Noel
Rotemberg, Veronica
Carrera, Cristina
Dusza, Stephen
Gutman, David
Helba, Brian
Kittler, Harald
Kurtansky, Nicholas R
Liopyris, Konstantinos
Marchetti, Michael A
Podlipnik, Sebastian
Puig, Susana
Rinner, Christoph
Tschandl, Philipp
Weber, Jochen
Halpern, Allan
Malvehy, Josep
author_sort Combalia, Marc
collection PubMed
description BACKGROUND: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy. METHODS: We designed a large dermoscopic image classification challenge to quantify the performance of machine learning algorithms for the task of skin cancer classification from dermoscopic images, and how this performance is affected by shifts in statistical distributions of data, disease categories not represented in training datasets, and imaging or lesion artifacts. Factors that might be beneficial to performance, such as clinical metadata and external training data collected by challenge participants, were also evaluated. 25 331 training images collected from two datasets (in Vienna [HAM10000] and Barcelona [BCN20000]) between Jan 1, 2000, and Dec 31, 2018, across eight skin diseases, were provided to challenge participants to design appropriate algorithms. The trained algorithms were then tested for balanced accuracy against the HAM10000 and BCN20000 test datasets and data from countries not included in the training dataset (Turkey, New Zealand, Sweden, and Argentina). Test datasets contained images of all diagnostic categories available in training plus other diagnoses not included in training data (not trained category). We compared the performance of the algorithms against that of 18 dermatologists in a simulated setting that reflected intended clinical use. FINDINGS: 64 teams submitted 129 state-of-the-art algorithm predictions on a test set of 8238 images. The best performing algorithm achieved 58·8% balanced accuracy on the BCN20000 data, which was designed to better reflect realistic clinical scenarios, compared with 82·0% balanced accuracy on HAM10000, which was used in a previously published benchmark. Shifted statistical distributions and disease categories not included in training data contributed to decreases in accuracy. Image artifacts, including hair, pen markings, ulceration, and imaging source institution, decreased accuracy in a complex manner that varied based on the underlying diagnosis. When comparing algorithms to expert dermatologists (2460 ratings on 1269 images), algorithms performed better than experts in most categories, except for actinic keratoses (similar accuracy on average) and images from categories not included in training data (26% correct for experts vs 6% correct for algorithms, p<0·0001). For the top 25 submitted algorithms, 47·1% of the images from categories not included in training data were misclassified as malignant diagnoses, which would lead to a substantial number of unnecessary biopsies if current state-of-the-art AI technologies were clinically deployed. INTERPRETATION: We have identified specific deficiencies and safety issues in AI diagnostic systems for skin cancer that should be addressed in future diagnostic evaluation protocols to improve safety and reliability in clinical practice. FUNDING: Melanoma Research Alliance and La Marató de TV3.
format Online
Article
Text
id pubmed-9295694
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-92956942022-07-19 Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge Combalia, Marc Codella, Noel Rotemberg, Veronica Carrera, Cristina Dusza, Stephen Gutman, David Helba, Brian Kittler, Harald Kurtansky, Nicholas R Liopyris, Konstantinos Marchetti, Michael A Podlipnik, Sebastian Puig, Susana Rinner, Christoph Tschandl, Philipp Weber, Jochen Halpern, Allan Malvehy, Josep Lancet Digit Health Article BACKGROUND: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy. METHODS: We designed a large dermoscopic image classification challenge to quantify the performance of machine learning algorithms for the task of skin cancer classification from dermoscopic images, and how this performance is affected by shifts in statistical distributions of data, disease categories not represented in training datasets, and imaging or lesion artifacts. Factors that might be beneficial to performance, such as clinical metadata and external training data collected by challenge participants, were also evaluated. 25 331 training images collected from two datasets (in Vienna [HAM10000] and Barcelona [BCN20000]) between Jan 1, 2000, and Dec 31, 2018, across eight skin diseases, were provided to challenge participants to design appropriate algorithms. The trained algorithms were then tested for balanced accuracy against the HAM10000 and BCN20000 test datasets and data from countries not included in the training dataset (Turkey, New Zealand, Sweden, and Argentina). Test datasets contained images of all diagnostic categories available in training plus other diagnoses not included in training data (not trained category). We compared the performance of the algorithms against that of 18 dermatologists in a simulated setting that reflected intended clinical use. FINDINGS: 64 teams submitted 129 state-of-the-art algorithm predictions on a test set of 8238 images. The best performing algorithm achieved 58·8% balanced accuracy on the BCN20000 data, which was designed to better reflect realistic clinical scenarios, compared with 82·0% balanced accuracy on HAM10000, which was used in a previously published benchmark. Shifted statistical distributions and disease categories not included in training data contributed to decreases in accuracy. Image artifacts, including hair, pen markings, ulceration, and imaging source institution, decreased accuracy in a complex manner that varied based on the underlying diagnosis. When comparing algorithms to expert dermatologists (2460 ratings on 1269 images), algorithms performed better than experts in most categories, except for actinic keratoses (similar accuracy on average) and images from categories not included in training data (26% correct for experts vs 6% correct for algorithms, p<0·0001). For the top 25 submitted algorithms, 47·1% of the images from categories not included in training data were misclassified as malignant diagnoses, which would lead to a substantial number of unnecessary biopsies if current state-of-the-art AI technologies were clinically deployed. INTERPRETATION: We have identified specific deficiencies and safety issues in AI diagnostic systems for skin cancer that should be addressed in future diagnostic evaluation protocols to improve safety and reliability in clinical practice. FUNDING: Melanoma Research Alliance and La Marató de TV3. 2022-05 /pmc/articles/PMC9295694/ /pubmed/35461690 http://dx.doi.org/10.1016/S2589-7500(22)00021-8 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article under the CC BY-NC-ND 4.0 license (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Article
Combalia, Marc
Codella, Noel
Rotemberg, Veronica
Carrera, Cristina
Dusza, Stephen
Gutman, David
Helba, Brian
Kittler, Harald
Kurtansky, Nicholas R
Liopyris, Konstantinos
Marchetti, Michael A
Podlipnik, Sebastian
Puig, Susana
Rinner, Christoph
Tschandl, Philipp
Weber, Jochen
Halpern, Allan
Malvehy, Josep
Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title_full Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title_fullStr Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title_full_unstemmed Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title_short Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
title_sort validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 international skin imaging collaboration grand challenge
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9295694/
https://www.ncbi.nlm.nih.gov/pubmed/35461690
http://dx.doi.org/10.1016/S2589-7500(22)00021-8
work_keys_str_mv AT combaliamarc validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT codellanoel validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT rotembergveronica validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT carreracristina validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT duszastephen validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT gutmandavid validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT helbabrian validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT kittlerharald validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT kurtanskynicholasr validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT liopyriskonstantinos validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT marchettimichaela validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT podlipniksebastian validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT puigsusana validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT rinnerchristoph validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT tschandlphilipp validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT weberjochen validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT halpernallan validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge
AT malvehyjosep validationofartificialintelligencepredictionmodelsforskincancerdiagnosisusingdermoscopyimagesthe2019internationalskinimagingcollaborationgrandchallenge