Cargando…

An efficient ensemble method for missing value imputation in microarray gene expression data

BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Xinshan, Wang, Jiayu, Sun, Biao, Ren, Chao, Yang, Ting, Ding, Jie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/ https://www.ncbi.nlm.nih.gov/pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4

_version_	1783678635336531968
author	Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie
author_facet	Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie
author_sort	Zhu, Xinshan
collection	PubMed
description	BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.
format	Online Article Text
id	pubmed-8045198
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-80451982021-04-14 An efficient ensemble method for missing value imputation in microarray gene expression data Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie BMC Bioinformatics Methodology Article BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way. BioMed Central 2021-04-13 /pmc/articles/PMC8045198/ /pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Article Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie An efficient ensemble method for missing value imputation in microarray gene expression data
title	An efficient ensemble method for missing value imputation in microarray gene expression data
title_full	An efficient ensemble method for missing value imputation in microarray gene expression data
title_fullStr	An efficient ensemble method for missing value imputation in microarray gene expression data
title_full_unstemmed	An efficient ensemble method for missing value imputation in microarray gene expression data
title_short	An efficient ensemble method for missing value imputation in microarray gene expression data
title_sort	efficient ensemble method for missing value imputation in microarray gene expression data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/ https://www.ncbi.nlm.nih.gov/pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4
work_keys_str_mv	AT zhuxinshan anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT wangjiayu anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT sunbiao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT renchao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT yangting anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT dingjie anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT zhuxinshan efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT wangjiayu efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT sunbiao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT renchao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT yangting efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT dingjie efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata

An efficient ensemble method for missing value imputation in microarray gene expression data

Ejemplares similares