Cargando…

Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images

Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic...

Descripción completa

Detalles Bibliográficos
Autores principales: Gong, Eun Jeong, Bang, Chang Seok, Lee, Jae Jun, Yang, Young Joo, Baik, Gwang Ho
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505038/
https://www.ncbi.nlm.nih.gov/pubmed/36143146
http://dx.doi.org/10.3390/jpm12091361
_version_ 1784796372693680128
author Gong, Eun Jeong
Bang, Chang Seok
Lee, Jae Jun
Yang, Young Joo
Baik, Gwang Ho
author_facet Gong, Eun Jeong
Bang, Chang Seok
Lee, Jae Jun
Yang, Young Joo
Baik, Gwang Ho
author_sort Gong, Eun Jeong
collection PubMed
description Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic images of colorectal polyps into four categories, including advanced colorectal cancer, early cancers/high-grade dysplasia, tubular adenoma, and nonneoplasm. The aim of this study was to evaluate the impact of the volume and distribution of training dataset classes in the development of deep-learning models for colorectal polyp histopathology prediction from endoscopic images. Methods: The same 3828 endoscopic images that were used to create earlier models were used. An additional 6838 images were used to find the optimal volume and class distribution for a deep-learning model. Various amounts of data volume and class distributions were tried to establish deep-learning models. The training of deep-learning models uniformly used no-code platform Neuro-T. Accuracy was the primary outcome on four-class prediction. Results: The highest internal-test classification accuracy in the original dataset, doubled dataset, and tripled dataset was commonly shown by doubling the proportion of data for fewer categories (2:2:1:1 for advanced colorectal cancer: early cancers/high-grade dysplasia: tubular adenoma: non-neoplasm). Doubling the proportion of data for fewer categories in the original dataset showed the highest accuracy (86.4%, 95% confidence interval: 85.0–97.8%) compared to that of the doubled or tripled dataset. The total required number of images in this performance was only 2418 images. Gradient-weighted class activation mapping confirmed that the part that the deep-learning model pays attention to coincides with the part that the endoscopist pays attention to. Conclusion: As a result of a data-volume-dependent performance plateau in the classification model of colonoscopy, a dataset that has been doubled or tripled is not always beneficial to training. Deep-learning models would be more accurate if the proportion of fewer category lesions was increased.
format Online
Article
Text
id pubmed-9505038
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-95050382022-09-24 Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images Gong, Eun Jeong Bang, Chang Seok Lee, Jae Jun Yang, Young Joo Baik, Gwang Ho J Pers Med Article Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic images of colorectal polyps into four categories, including advanced colorectal cancer, early cancers/high-grade dysplasia, tubular adenoma, and nonneoplasm. The aim of this study was to evaluate the impact of the volume and distribution of training dataset classes in the development of deep-learning models for colorectal polyp histopathology prediction from endoscopic images. Methods: The same 3828 endoscopic images that were used to create earlier models were used. An additional 6838 images were used to find the optimal volume and class distribution for a deep-learning model. Various amounts of data volume and class distributions were tried to establish deep-learning models. The training of deep-learning models uniformly used no-code platform Neuro-T. Accuracy was the primary outcome on four-class prediction. Results: The highest internal-test classification accuracy in the original dataset, doubled dataset, and tripled dataset was commonly shown by doubling the proportion of data for fewer categories (2:2:1:1 for advanced colorectal cancer: early cancers/high-grade dysplasia: tubular adenoma: non-neoplasm). Doubling the proportion of data for fewer categories in the original dataset showed the highest accuracy (86.4%, 95% confidence interval: 85.0–97.8%) compared to that of the doubled or tripled dataset. The total required number of images in this performance was only 2418 images. Gradient-weighted class activation mapping confirmed that the part that the deep-learning model pays attention to coincides with the part that the endoscopist pays attention to. Conclusion: As a result of a data-volume-dependent performance plateau in the classification model of colonoscopy, a dataset that has been doubled or tripled is not always beneficial to training. Deep-learning models would be more accurate if the proportion of fewer category lesions was increased. MDPI 2022-08-24 /pmc/articles/PMC9505038/ /pubmed/36143146 http://dx.doi.org/10.3390/jpm12091361 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Gong, Eun Jeong
Bang, Chang Seok
Lee, Jae Jun
Yang, Young Joo
Baik, Gwang Ho
Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title_full Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title_fullStr Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title_full_unstemmed Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title_short Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
title_sort impact of the volume and distribution of training datasets in the development of deep-learning models for the diagnosis of colorectal polyps in endoscopy images
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505038/
https://www.ncbi.nlm.nih.gov/pubmed/36143146
http://dx.doi.org/10.3390/jpm12091361
work_keys_str_mv AT gongeunjeong impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages
AT bangchangseok impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages
AT leejaejun impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages
AT yangyoungjoo impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages
AT baikgwangho impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages