Cargando…
Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images
Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505038/ https://www.ncbi.nlm.nih.gov/pubmed/36143146 http://dx.doi.org/10.3390/jpm12091361 |
_version_ | 1784796372693680128 |
---|---|
author | Gong, Eun Jeong Bang, Chang Seok Lee, Jae Jun Yang, Young Joo Baik, Gwang Ho |
author_facet | Gong, Eun Jeong Bang, Chang Seok Lee, Jae Jun Yang, Young Joo Baik, Gwang Ho |
author_sort | Gong, Eun Jeong |
collection | PubMed |
description | Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic images of colorectal polyps into four categories, including advanced colorectal cancer, early cancers/high-grade dysplasia, tubular adenoma, and nonneoplasm. The aim of this study was to evaluate the impact of the volume and distribution of training dataset classes in the development of deep-learning models for colorectal polyp histopathology prediction from endoscopic images. Methods: The same 3828 endoscopic images that were used to create earlier models were used. An additional 6838 images were used to find the optimal volume and class distribution for a deep-learning model. Various amounts of data volume and class distributions were tried to establish deep-learning models. The training of deep-learning models uniformly used no-code platform Neuro-T. Accuracy was the primary outcome on four-class prediction. Results: The highest internal-test classification accuracy in the original dataset, doubled dataset, and tripled dataset was commonly shown by doubling the proportion of data for fewer categories (2:2:1:1 for advanced colorectal cancer: early cancers/high-grade dysplasia: tubular adenoma: non-neoplasm). Doubling the proportion of data for fewer categories in the original dataset showed the highest accuracy (86.4%, 95% confidence interval: 85.0–97.8%) compared to that of the doubled or tripled dataset. The total required number of images in this performance was only 2418 images. Gradient-weighted class activation mapping confirmed that the part that the deep-learning model pays attention to coincides with the part that the endoscopist pays attention to. Conclusion: As a result of a data-volume-dependent performance plateau in the classification model of colonoscopy, a dataset that has been doubled or tripled is not always beneficial to training. Deep-learning models would be more accurate if the proportion of fewer category lesions was increased. |
format | Online Article Text |
id | pubmed-9505038 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-95050382022-09-24 Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images Gong, Eun Jeong Bang, Chang Seok Lee, Jae Jun Yang, Young Joo Baik, Gwang Ho J Pers Med Article Background: Establishment of an artificial intelligence model in gastrointestinal endoscopy has no standardized dataset. The optimal volume or class distribution of training datasets has not been evaluated. An artificial intelligence model was previously created by the authors to classify endoscopic images of colorectal polyps into four categories, including advanced colorectal cancer, early cancers/high-grade dysplasia, tubular adenoma, and nonneoplasm. The aim of this study was to evaluate the impact of the volume and distribution of training dataset classes in the development of deep-learning models for colorectal polyp histopathology prediction from endoscopic images. Methods: The same 3828 endoscopic images that were used to create earlier models were used. An additional 6838 images were used to find the optimal volume and class distribution for a deep-learning model. Various amounts of data volume and class distributions were tried to establish deep-learning models. The training of deep-learning models uniformly used no-code platform Neuro-T. Accuracy was the primary outcome on four-class prediction. Results: The highest internal-test classification accuracy in the original dataset, doubled dataset, and tripled dataset was commonly shown by doubling the proportion of data for fewer categories (2:2:1:1 for advanced colorectal cancer: early cancers/high-grade dysplasia: tubular adenoma: non-neoplasm). Doubling the proportion of data for fewer categories in the original dataset showed the highest accuracy (86.4%, 95% confidence interval: 85.0–97.8%) compared to that of the doubled or tripled dataset. The total required number of images in this performance was only 2418 images. Gradient-weighted class activation mapping confirmed that the part that the deep-learning model pays attention to coincides with the part that the endoscopist pays attention to. Conclusion: As a result of a data-volume-dependent performance plateau in the classification model of colonoscopy, a dataset that has been doubled or tripled is not always beneficial to training. Deep-learning models would be more accurate if the proportion of fewer category lesions was increased. MDPI 2022-08-24 /pmc/articles/PMC9505038/ /pubmed/36143146 http://dx.doi.org/10.3390/jpm12091361 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Gong, Eun Jeong Bang, Chang Seok Lee, Jae Jun Yang, Young Joo Baik, Gwang Ho Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title | Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title_full | Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title_fullStr | Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title_full_unstemmed | Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title_short | Impact of the Volume and Distribution of Training Datasets in the Development of Deep-Learning Models for the Diagnosis of Colorectal Polyps in Endoscopy Images |
title_sort | impact of the volume and distribution of training datasets in the development of deep-learning models for the diagnosis of colorectal polyps in endoscopy images |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9505038/ https://www.ncbi.nlm.nih.gov/pubmed/36143146 http://dx.doi.org/10.3390/jpm12091361 |
work_keys_str_mv | AT gongeunjeong impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages AT bangchangseok impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages AT leejaejun impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages AT yangyoungjoo impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages AT baikgwangho impactofthevolumeanddistributionoftrainingdatasetsinthedevelopmentofdeeplearningmodelsforthediagnosisofcolorectalpolypsinendoscopyimages |