Cargando…

How to balance the bioinformatics data: pseudo-negative sampling

BACKGROUND: Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of pos...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Yongqing, Qiao, Shaojie, Lu, Rongzhao, Han, Nan, Liu, Dingxiang, Zhou, Jiliu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929457/ https://www.ncbi.nlm.nih.gov/pubmed/31874622 http://dx.doi.org/10.1186/s12859-019-3269-4

_version_	1783482704871817216
author	Zhang, Yongqing Qiao, Shaojie Lu, Rongzhao Han, Nan Liu, Dingxiang Zhou, Jiliu
author_facet	Zhang, Yongqing Qiao, Shaojie Lu, Rongzhao Han, Nan Liu, Dingxiang Zhou, Jiliu
author_sort	Zhang, Yongqing
collection	PubMed
description	BACKGROUND: Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. RESULTS: In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. CONCLUSIONS: To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.
format	Online Article Text
id	pubmed-6929457
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-69294572019-12-30 How to balance the bioinformatics data: pseudo-negative sampling Zhang, Yongqing Qiao, Shaojie Lu, Rongzhao Han, Nan Liu, Dingxiang Zhou, Jiliu BMC Bioinformatics Methodology BACKGROUND: Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. RESULTS: In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. CONCLUSIONS: To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset. BioMed Central 2019-12-24 /pmc/articles/PMC6929457/ /pubmed/31874622 http://dx.doi.org/10.1186/s12859-019-3269-4 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Zhang, Yongqing Qiao, Shaojie Lu, Rongzhao Han, Nan Liu, Dingxiang Zhou, Jiliu How to balance the bioinformatics data: pseudo-negative sampling
title	How to balance the bioinformatics data: pseudo-negative sampling
title_full	How to balance the bioinformatics data: pseudo-negative sampling
title_fullStr	How to balance the bioinformatics data: pseudo-negative sampling
title_full_unstemmed	How to balance the bioinformatics data: pseudo-negative sampling
title_short	How to balance the bioinformatics data: pseudo-negative sampling
title_sort	how to balance the bioinformatics data: pseudo-negative sampling
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929457/ https://www.ncbi.nlm.nih.gov/pubmed/31874622 http://dx.doi.org/10.1186/s12859-019-3269-4
work_keys_str_mv	AT zhangyongqing howtobalancethebioinformaticsdatapseudonegativesampling AT qiaoshaojie howtobalancethebioinformaticsdatapseudonegativesampling AT lurongzhao howtobalancethebioinformaticsdatapseudonegativesampling AT hannan howtobalancethebioinformaticsdatapseudonegativesampling AT liudingxiang howtobalancethebioinformaticsdatapseudonegativesampling AT zhoujiliu howtobalancethebioinformaticsdatapseudonegativesampling

How to balance the bioinformatics data: pseudo-negative sampling

Ejemplares similares