Cargando…
Multiple imputation and analysis for high‐dimensional incomplete proteomics data
Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
John Wiley and Sons Inc.
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/ https://www.ncbi.nlm.nih.gov/pubmed/26565662 http://dx.doi.org/10.1002/sim.6800 |
_version_ | 1782419333495914496 |
---|---|
author | Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. |
author_facet | Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. |
author_sort | Yin, Xiaoyan |
collection | PubMed |
description | Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. |
format | Online Article Text |
id | pubmed-4777663 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | John Wiley and Sons Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-47776632016-10-19 Multiple imputation and analysis for high‐dimensional incomplete proteomics data Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. Stat Med Research Articles Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case–control study of 135 incident cases of myocardial infarction and 135 pair‐matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case–control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction. © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. John Wiley and Sons Inc. 2015-11-12 2016-04-15 /pmc/articles/PMC4777663/ /pubmed/26565662 http://dx.doi.org/10.1002/sim.6800 Text en © 2015 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs (http://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made. |
spellingShingle | Research Articles Yin, Xiaoyan Levy, Daniel Willinger, Christine Adourian, Aram Larson, Martin G. Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title | Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title_full | Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title_fullStr | Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title_full_unstemmed | Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title_short | Multiple imputation and analysis for high‐dimensional incomplete proteomics data |
title_sort | multiple imputation and analysis for high‐dimensional incomplete proteomics data |
topic | Research Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4777663/ https://www.ncbi.nlm.nih.gov/pubmed/26565662 http://dx.doi.org/10.1002/sim.6800 |
work_keys_str_mv | AT yinxiaoyan multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT levydaniel multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT willingerchristine multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT adourianaram multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata AT larsonmarting multipleimputationandanalysisforhighdimensionalincompleteproteomicsdata |