Cargando…

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

BACKGROUND: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression...

Descripción completa

Detalles Bibliográficos
Autores principales:	Brock, Guy N, Shaffer, John R, Blakesley, Richard E, Lotz, Meredith J, Tseng, George C
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2253514/ https://www.ncbi.nlm.nih.gov/pubmed/18186917 http://dx.doi.org/10.1186/1471-2105-9-12

_version_	1782151110806470656
author	Brock, Guy N Shaffer, John R Blakesley, Richard E Lotz, Meredith J Tseng, George C
author_facet	Brock, Guy N Shaffer, John R Blakesley, Richard E Lotz, Meredith J Tseng, George C
author_sort	Brock, Guy N
collection	PubMed
description	BACKGROUND: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. RESULTS: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. CONCLUSION: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
format	Text
id	pubmed-2253514
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22535142008-03-19 Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes Brock, Guy N Shaffer, John R Blakesley, Richard E Lotz, Meredith J Tseng, George C BMC Bioinformatics Research Article BACKGROUND: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. RESULTS: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. CONCLUSION: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm. BioMed Central 2008-01-10 /pmc/articles/PMC2253514/ /pubmed/18186917 http://dx.doi.org/10.1186/1471-2105-9-12 Text en Copyright © 2008 Brock et al; licensee BioMed Central Ltd. https://creativecommons.org/licenses/by/2.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Brock, Guy N Shaffer, John R Blakesley, Richard E Lotz, Meredith J Tseng, George C Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title	Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title_full	Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title_fullStr	Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title_full_unstemmed	Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title_short	Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
title_sort	which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2253514/ https://www.ncbi.nlm.nih.gov/pubmed/18186917 http://dx.doi.org/10.1186/1471-2105-9-12
work_keys_str_mv	AT brockguyn whichmissingvalueimputationmethodtouseinexpressionprofilesacomparativestudyandtwoselectionschemes AT shafferjohnr whichmissingvalueimputationmethodtouseinexpressionprofilesacomparativestudyandtwoselectionschemes AT blakesleyricharde whichmissingvalueimputationmethodtouseinexpressionprofilesacomparativestudyandtwoselectionschemes AT lotzmeredithj whichmissingvalueimputationmethodtouseinexpressionprofilesacomparativestudyandtwoselectionschemes AT tsenggeorgec whichmissingvalueimputationmethodtouseinexpressionprofilesacomparativestudyandtwoselectionschemes

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Ejemplares similares