Cargando…

Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity

ABSTRACT: Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jain, Sankalp, Kotsampasakou, Eleni, Ecker, Gerhard F.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5919997/ https://www.ncbi.nlm.nih.gov/pubmed/29626291 http://dx.doi.org/10.1007/s10822-018-0116-z

_version_	1783317742134231040
author	Jain, Sankalp Kotsampasakou, Eleni Ecker, Gerhard F.
author_facet	Jain, Sankalp Kotsampasakou, Eleni Ecker, Gerhard F.
author_sort	Jain, Sankalp
collection	PubMed
description	ABSTRACT: Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies. GRAPHICAL ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s10822-018-0116-z) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5919997
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-59199972018-05-01 Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity Jain, Sankalp Kotsampasakou, Eleni Ecker, Gerhard F. J Comput Aided Mol Des Article ABSTRACT: Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies. GRAPHICAL ABSTRACT: [Image: see text] ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s10822-018-0116-z) contains supplementary material, which is available to authorized users. Springer International Publishing 2018-04-06 2018 /pmc/articles/PMC5919997/ /pubmed/29626291 http://dx.doi.org/10.1007/s10822-018-0116-z Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Article Jain, Sankalp Kotsampasakou, Eleni Ecker, Gerhard F. Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title	Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title_full	Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title_fullStr	Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title_full_unstemmed	Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title_short	Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
title_sort	comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5919997/ https://www.ncbi.nlm.nih.gov/pubmed/29626291 http://dx.doi.org/10.1007/s10822-018-0116-z
work_keys_str_mv	AT jainsankalp comparingtheperformanceofmetaclassifiersacasestudyonselectedimbalanceddatasetsrelevantforpredictionoflivertoxicity AT kotsampasakoueleni comparingtheperformanceofmetaclassifiersacasestudyonselectedimbalanceddatasetsrelevantforpredictionoflivertoxicity AT eckergerhardf comparingtheperformanceofmetaclassifiersacasestudyonselectedimbalanceddatasetsrelevantforpredictionoflivertoxicity

Comparing the performance of meta-classifiers—a case study on selected imbalanced data sets relevant for prediction of liver toxicity

Ejemplares similares