Cargando…

The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics

Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wei, Qiong, Dunbrack, Roland L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706434/ https://www.ncbi.nlm.nih.gov/pubmed/23874456 http://dx.doi.org/10.1371/journal.pone.0067863

_version_	1782476560484270080
author	Wei, Qiong Dunbrack, Roland L.
author_facet	Wei, Qiong Dunbrack, Roland L.
author_sort	Wei, Qiong
collection	PubMed
description	Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.
format	Online Article Text
id	pubmed-3706434
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-37064342013-07-19 The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics Wei, Qiong Dunbrack, Roland L. PLoS One Research Article Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand. Public Library of Science 2013-07-09 /pmc/articles/PMC3706434/ /pubmed/23874456 http://dx.doi.org/10.1371/journal.pone.0067863 Text en © 2013 Wei, Dunbrack http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Wei, Qiong Dunbrack, Roland L. The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title	The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title_full	The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title_fullStr	The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title_full_unstemmed	The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title_short	The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
title_sort	role of balanced training and testing data sets for binary classifiers in bioinformatics
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706434/ https://www.ncbi.nlm.nih.gov/pubmed/23874456 http://dx.doi.org/10.1371/journal.pone.0067863
work_keys_str_mv	AT weiqiong theroleofbalancedtrainingandtestingdatasetsforbinaryclassifiersinbioinformatics AT dunbrackrolandl theroleofbalancedtrainingandtestingdatasetsforbinaryclassifiersinbioinformatics AT weiqiong roleofbalancedtrainingandtestingdatasetsforbinaryclassifiersinbioinformatics AT dunbrackrolandl roleofbalancedtrainingandtestingdatasetsforbinaryclassifiersinbioinformatics

The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics

Ejemplares similares