Cargando…

Multiple imputation and analysis for high‐dimensional incomplete proteomics data

Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort...

Descripción completa

Detalles Bibliográficos
Autores principales: Yin, Xiaoyan, Levy, Daniel, Willinger, Christine, Adourian, Aram, Larson, Martin G.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/
https://www.ncbi.nlm.nih.gov/pubmed/26565662
http://dx.doi.org/10.1002/sim.6800
_version_ 1782419333495914496
author Yin, Xiaoyan
Levy, Daniel
Willinger, Christine
Adourian, Aram
Larson, Martin G.
author_facet Yin, Xiaoyan
Levy, Daniel
Willinger, Christine
Adourian, Aram
Larson, Martin G.
author_sort Yin, Xiaoyan
collection PubMed
description Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
format Online
Article
Text
id pubmed-4777663
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-47776632016-10-19 Multiple imputation and analysis for high‐dimensional incomplete proteomics data Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. Stat Med Research Articles Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. John Wiley and Sons Inc. 2015-11-12 2016-04-15 /pmc/articles/PMC4777663/ /pubmed/26565662 http://dx.doi.org/10.1002/sim.6800 Text en © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs (http://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle Research Articles
Yin, Xiaoyan
Levy, Daniel
Willinger, Christine
Adourian, Aram
Larson, Martin G.
Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_full Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_fullStr Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_full_unstemmed Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_short Multiple imputation and analysis for high‐dimensional incomplete proteomics data
title_sort multiple imputation and analysis for high‐dimensional incomplete proteomics data
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/
https://www.ncbi.nlm.nih.gov/pubmed/26565662
http://dx.doi.org/10.1002/sim.6800
work_keys_str_mv AT yinxiaoyan multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata
AT levydaniel multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata
AT willingerchristine multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata
AT adourianaram multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata
AT larsonmarting multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata