Cargando…

Multiple imputation and analysis for high‐dimensional incomplete proteomics data

Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yin, Xiaoyan, Levy, Daniel, Willinger, Christine, Adourian, Aram, Larson, Martin G.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley and Sons Inc. 2015
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/ https://www.ncbi.nlm.nih.gov/pubmed/26565662 http://dx.doi.org/10.1002/sim.6800

_version_	1782419333495914496
author	Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G.
author_facet	Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G.
author_sort	Yin, Xiaoyan
collection	PubMed
description	Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
format	Online Article Text
id	pubmed-4777663
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	John Wiley and Sons Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-47776632016-10-19 Multiple imputation and analysis for high‐dimensional incomplete proteomics data Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. Stat Med Research Articles Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. John Wiley and Sons Inc. 2015-11-12 2016-04-15 /pmc/articles/PMC4777663/ /pubmed/26565662 http://dx.doi.org/10.1002/sim.6800 Text en © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs (http://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle	Research Articles Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title	Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_full	Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_fullStr	Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_full_unstemmed	Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_short	Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_sort	multiple imputation and analysis for high‐dimensional incomplete proteomics data
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/ https://www.ncbi.nlm.nih.gov/pubmed/26565662 http://dx.doi.org/10.1002/sim.6800
work_keys_str_mv	AT yinxiaoyan multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT levydaniel multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT willingerchristine multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT adourianaram multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT larsonmarting multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata

Multiple imputation and analysis for high‐dimensional incomplete proteomics data

Ejemplares similares