Cargando…
Using penalized regression to predict phenotype from SNP data
BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome t...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/ https://www.ncbi.nlm.nih.gov/pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2 |
_version_ | 1783358230754230272 |
---|---|
author | Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. |
author_facet | Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. |
author_sort | Cherlin, Svetlana |
collection | PubMed |
description | BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction. |
format | Online Article Text |
id | pubmed-6157193 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-61571932018-10-01 Using penalized regression to predict phenotype from SNP data Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. BMC Proc Proceedings BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction. BioMed Central 2018-09-17 /pmc/articles/PMC6157193/ /pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. Using penalized regression to predict phenotype from SNP data |
title | Using penalized regression to predict phenotype from SNP data |
title_full | Using penalized regression to predict phenotype from SNP data |
title_fullStr | Using penalized regression to predict phenotype from SNP data |
title_full_unstemmed | Using penalized regression to predict phenotype from SNP data |
title_short | Using penalized regression to predict phenotype from SNP data |
title_sort | using penalized regression to predict phenotype from snp data |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/ https://www.ncbi.nlm.nih.gov/pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2 |
work_keys_str_mv | AT cherlinsvetlana usingpenalizedregressiontopredictphenotypefromsnpdata AT howeyrichardaj usingpenalizedregressiontopredictphenotypefromsnpdata AT cordellheatherj usingpenalizedregressiontopredictphenotypefromsnpdata |