Cargando…

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

BACKGROUND: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liao, Serena G, Lin, Yan, Kang, Dongwan D, Chandra, Divay, Bon, Jessica, Kaminski, Naftali, Sciurba, Frank C, Tseng, George C
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228077/ https://www.ncbi.nlm.nih.gov/pubmed/25371041 http://dx.doi.org/10.1186/s12859-014-0346-6

_version_	1782343910119440384
author	Liao, Serena G Lin, Yan Kang, Dongwan D Chandra, Divay Bon, Jessica Kaminski, Naftali Sciurba, Frank C Tseng, George C
author_facet	Liao, Serena G Lin, Yan Kang, Dongwan D Chandra, Divay Bon, Jessica Kaminski, Naftali Sciurba, Frank C Tseng, George C
author_sort	Liao, Serena G
collection	PubMed
description	BACKGROUND: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. RESULTS: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available. CONCLUSIONS: Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4228077
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42280772014-11-12 Missing value imputation in high-dimensional phenomic data: imputable or not, and how? Liao, Serena G Lin, Yan Kang, Dongwan D Chandra, Divay Bon, Jessica Kaminski, Naftali Sciurba, Frank C Tseng, George C BMC Bioinformatics Research Article BACKGROUND: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. RESULTS: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available. CONCLUSIONS: Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users. BioMed Central 2014-11-05 /pmc/articles/PMC4228077/ /pubmed/25371041 http://dx.doi.org/10.1186/s12859-014-0346-6 Text en © Liao et al; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Liao, Serena G Lin, Yan Kang, Dongwan D Chandra, Divay Bon, Jessica Kaminski, Naftali Sciurba, Frank C Tseng, George C Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title	Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title_full	Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title_fullStr	Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title_full_unstemmed	Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title_short	Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
title_sort	missing value imputation in high-dimensional phenomic data: imputable or not, and how?
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4228077/ https://www.ncbi.nlm.nih.gov/pubmed/25371041 http://dx.doi.org/10.1186/s12859-014-0346-6
work_keys_str_mv	AT liaoserenag missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT linyan missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT kangdongwand missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT chandradivay missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT bonjessica missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT kaminskinaftali missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT sciurbafrankc missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow AT tsenggeorgec missingvalueimputationinhighdimensionalphenomicdataimputableornotandhow

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Ejemplares similares