Cargando…

Using penalized regression to predict phenotype from SNP data

BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome t...

Descripción completa

Detalles Bibliográficos
Autores principales: Cherlin, Svetlana, Howey, Richard A. J., Cordell, Heather J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/
https://www.ncbi.nlm.nih.gov/pubmed/30275888
http://dx.doi.org/10.1186/s12919-018-0149-2
_version_ 1783358230754230272
author Cherlin, Svetlana
Howey, Richard A. J.
Cordell, Heather J.
author_facet Cherlin, Svetlana
Howey, Richard A. J.
Cordell, Heather J.
author_sort Cherlin, Svetlana
collection PubMed
description BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction.
format Online
Article
Text
id pubmed-6157193
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-61571932018-10-01 Using penalized regression to predict phenotype from SNP data Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. BMC Proc Proceedings BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction. BioMed Central 2018-09-17 /pmc/articles/PMC6157193/ /pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Cherlin, Svetlana
Howey, Richard A. J.
Cordell, Heather J.
Using penalized regression to predict phenotype from SNP data
title Using penalized regression to predict phenotype from SNP data
title_full Using penalized regression to predict phenotype from SNP data
title_fullStr Using penalized regression to predict phenotype from SNP data
title_full_unstemmed Using penalized regression to predict phenotype from SNP data
title_short Using penalized regression to predict phenotype from SNP data
title_sort using penalized regression to predict phenotype from snp data
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/
https://www.ncbi.nlm.nih.gov/pubmed/30275888
http://dx.doi.org/10.1186/s12919-018-0149-2
work_keys_str_mv AT cherlinsvetlana usingpenalizedregressiontopredictphenotypefromsnpdata
AT howeyrichardaj usingpenalizedregressiontopredictphenotypefromsnpdata
AT cordellheatherj usingpenalizedregressiontopredictphenotypefromsnpdata