Cargando…

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rajaraman, Sivaramakrishnan, Zamzmi, Ghada, Yang, Feng, Liang, Zhaohui, Xue, Zhiyun, Antani, Sameer
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cornell University 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10659445/ https://www.ncbi.nlm.nih.gov/pubmed/37986725

_version_	1785137578652991488
author	Rajaraman, Sivaramakrishnan Zamzmi, Ghada Yang, Feng Liang, Zhaohui Xue, Zhiyun Antani, Sameer
author_facet	Rajaraman, Sivaramakrishnan Zamzmi, Ghada Yang, Feng Liang, Zhaohui Xue, Zhiyun Antani, Sameer
author_sort	Rajaraman, Sivaramakrishnan
collection	PubMed
description	Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.
format	Online Article Text
id	pubmed-10659445
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cornell University
record_format	MEDLINE/PubMed
spelling	pubmed-106594452023-09-18 Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays Rajaraman, Sivaramakrishnan Zamzmi, Ghada Yang, Feng Liang, Zhaohui Xue, Zhiyun Antani, Sameer ArXiv Article Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data. Cornell University 2023-09-18 /pmc/articles/PMC10659445/ /pubmed/37986725 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle	Article Rajaraman, Sivaramakrishnan Zamzmi, Ghada Yang, Feng Liang, Zhaohui Xue, Zhiyun Antani, Sameer Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title	Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title_full	Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title_fullStr	Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title_full_unstemmed	Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title_short	Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
title_sort	semantically redundant training data removal and deep model classification performance: a study with chest x-rays
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10659445/ https://www.ncbi.nlm.nih.gov/pubmed/37986725
work_keys_str_mv	AT rajaramansivaramakrishnan semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays AT zamzmighada semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays AT yangfeng semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays AT liangzhaohui semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays AT xuezhiyun semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays AT antanisameer semanticallyredundanttrainingdataremovalanddeepmodelclassificationperformanceastudywithchestxrays

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Ejemplares similares