Cargando…
An efficient ensemble method for missing value imputation in microarray gene expression data
BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/ https://www.ncbi.nlm.nih.gov/pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4 |
_version_ | 1783678635336531968 |
---|---|
author | Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie |
author_facet | Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie |
author_sort | Zhu, Xinshan |
collection | PubMed |
description | BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way. |
format | Online Article Text |
id | pubmed-8045198 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-80451982021-04-14 An efficient ensemble method for missing value imputation in microarray gene expression data Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie BMC Bioinformatics Methodology Article BACKGROUND: The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS: In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION: The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way. BioMed Central 2021-04-13 /pmc/articles/PMC8045198/ /pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Zhu, Xinshan Wang, Jiayu Sun, Biao Ren, Chao Yang, Ting Ding, Jie An efficient ensemble method for missing value imputation in microarray gene expression data |
title | An efficient ensemble method for missing value imputation in microarray gene expression data |
title_full | An efficient ensemble method for missing value imputation in microarray gene expression data |
title_fullStr | An efficient ensemble method for missing value imputation in microarray gene expression data |
title_full_unstemmed | An efficient ensemble method for missing value imputation in microarray gene expression data |
title_short | An efficient ensemble method for missing value imputation in microarray gene expression data |
title_sort | efficient ensemble method for missing value imputation in microarray gene expression data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045198/ https://www.ncbi.nlm.nih.gov/pubmed/33849444 http://dx.doi.org/10.1186/s12859-021-04109-4 |
work_keys_str_mv | AT zhuxinshan anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT wangjiayu anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT sunbiao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT renchao anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT yangting anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT dingjie anefficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT zhuxinshan efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT wangjiayu efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT sunbiao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT renchao efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT yangting efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata AT dingjie efficientensemblemethodformissingvalueimputationinmicroarraygeneexpressiondata |