Cargando…
Efficient Implementation of Penalized Regression for Genetic Risk Prediction
Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk ind...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Genetics Society of America
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6499521/ https://www.ncbi.nlm.nih.gov/pubmed/30808621 http://dx.doi.org/10.1534/genetics.119.302019 |
_version_ | 1783415803523104768 |
---|---|
author | Privé, Florian Aschard, Hugues Blum, Michael G. B. |
author_facet | Privé, Florian Aschard, Hugues Blum, Michael G. B. |
author_sort | Privé, Florian |
collection | PubMed |
description | Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr. |
format | Online Article Text |
id | pubmed-6499521 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Genetics Society of America |
record_format | MEDLINE/PubMed |
spelling | pubmed-64995212020-05-01 Efficient Implementation of Penalized Regression for Genetic Risk Prediction Privé, Florian Aschard, Hugues Blum, Michael G. B. Genetics Investigations Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr. Genetics Society of America 2019-05 2019-02-26 /pmc/articles/PMC6499521/ /pubmed/30808621 http://dx.doi.org/10.1534/genetics.119.302019 Text en Copyright © 2019 Privé et al. Available freely online through the author-supported open access option. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Investigations Privé, Florian Aschard, Hugues Blum, Michael G. B. Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title | Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title_full | Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title_fullStr | Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title_full_unstemmed | Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title_short | Efficient Implementation of Penalized Regression for Genetic Risk Prediction |
title_sort | efficient implementation of penalized regression for genetic risk prediction |
topic | Investigations |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6499521/ https://www.ncbi.nlm.nih.gov/pubmed/30808621 http://dx.doi.org/10.1534/genetics.119.302019 |
work_keys_str_mv | AT priveflorian efficientimplementationofpenalizedregressionforgeneticriskprediction AT aschardhugues efficientimplementationofpenalizedregressionforgeneticriskprediction AT blummichaelgb efficientimplementationofpenalizedregressionforgeneticriskprediction |