Cargando…

Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets

The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Idakwo, Gabriel, Thangapandian, Sundar, Luttrell, Joseph, Li, Yan, Wang, Nan, Zhou, Zhaoxian, Hong, Huixiao, Yang, Bei, Zhang, Chaoyang, Gong, Ping
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7592558/ https://www.ncbi.nlm.nih.gov/pubmed/33372637 http://dx.doi.org/10.1186/s13321-020-00468-x

_version_	1783601211281244160
author	Idakwo, Gabriel Thangapandian, Sundar Luttrell, Joseph Li, Yan Wang, Nan Zhou, Zhaoxian Hong, Huixiao Yang, Bei Zhang, Chaoyang Gong, Ping
author_facet	Idakwo, Gabriel Thangapandian, Sundar Luttrell, Joseph Li, Yan Wang, Nan Zhou, Zhaoxian Hong, Huixiao Yang, Bei Zhang, Chaoyang Gong, Ping
author_sort	Idakwo, Gabriel
collection	PubMed
description	The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F(1) score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.
format	Online Article Text
id	pubmed-7592558
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-75925582020-10-29 Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets Idakwo, Gabriel Thangapandian, Sundar Luttrell, Joseph Li, Yan Wang, Nan Zhou, Zhaoxian Hong, Huixiao Yang, Bei Zhang, Chaoyang Gong, Ping J Cheminform Research Article The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F(1) score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing. Springer International Publishing 2020-10-27 /pmc/articles/PMC7592558/ /pubmed/33372637 http://dx.doi.org/10.1186/s13321-020-00468-x Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Idakwo, Gabriel Thangapandian, Sundar Luttrell, Joseph Li, Yan Wang, Nan Zhou, Zhaoxian Hong, Huixiao Yang, Bei Zhang, Chaoyang Gong, Ping Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title	Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title_full	Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title_fullStr	Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title_full_unstemmed	Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title_short	Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets
title_sort	structure–activity relationship-based chemical classification of highly imbalanced tox21 datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7592558/ https://www.ncbi.nlm.nih.gov/pubmed/33372637 http://dx.doi.org/10.1186/s13321-020-00468-x
work_keys_str_mv	AT idakwogabriel structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT thangapandiansundar structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT luttrelljoseph structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT liyan structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT wangnan structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT zhouzhaoxian structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT honghuixiao structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT yangbei structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT zhangchaoyang structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets AT gongping structureactivityrelationshipbasedchemicalclassificationofhighlyimbalancedtox21datasets

Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets

Ejemplares similares