Cargando…

An efficient ensemble method for missing value imputation in microarray gene expression data

BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhu, Xinshan, Wang, Jiayu, Sun, Biao, Ren, Chao, Yang, Ting, Ding, Jie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/
https://www.ncbi.nlm.nih.gov/pubmed/33849444
http://dx.doi.org/10.1186/s12859-021-04109-4
_version_ 1783678635336531968
author Zhu, Xinshan
Wang, Jiayu
Sun, Biao
Ren, Chao
Yang, Ting
Ding, Jie
author_facet Zhu, Xinshan
Wang, Jiayu
Sun, Biao
Ren, Chao
Yang, Ting
Ding, Jie
author_sort Zhu, Xinshan
collection PubMed
description BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.
format Online
Article
Text
id pubmed-8045198
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-80451982021-04-14 An efficient ensemble method for missing value imputation in microarray gene expression data Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie BMC Bioinformatics Methodology Article BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way. BioMed Central 2021-04-13 /pmc/articles/PMC8045198/ /pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Zhu, Xinshan
Wang, Jiayu
Sun, Biao
Ren, Chao
Yang, Ting
Ding, Jie
An efficient ensemble method for missing value imputation in microarray gene expression data
title An efficient ensemble method for missing value imputation in microarray gene expression data
title_full An efficient ensemble method for missing value imputation in microarray gene expression data
title_fullStr An efficient ensemble method for missing value imputation in microarray gene expression data
title_full_unstemmed An efficient ensemble method for missing value imputation in microarray gene expression data
title_short An efficient ensemble method for missing value imputation in microarray gene expression data
title_sort efficient ensemble method for missing value imputation in microarray gene expression data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/
https://www.ncbi.nlm.nih.gov/pubmed/33849444
http://dx.doi.org/10.1186/s12859-021-04109-4
work_keys_str_mv AT zhuxinshan anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT wangjiayu anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT sunbiao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT renchao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT yangting anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT dingjie anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT zhuxinshan efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT wangjiayu efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT sunbiao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT renchao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT yangting efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata
AT dingjie efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata