Cargando…

Using penalized regression to predict phenotype from SNP data

BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cherlin, Svetlana, Howey, Richard A. J., Cordell, Heather J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/ https://www.ncbi.nlm.nih.gov/pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2

_version_	1783358230754230272
author	Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J.
author_facet	Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J.
author_sort	Cherlin, Svetlana
collection	PubMed
description	BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction.
format	Online Article Text
id	pubmed-6157193
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-61571932018-10-01 Using penalized regression to predict phenotype from SNP data Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. BMC Proc Proceedings BACKGROUND: In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero. METHODS: We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter. RESULTS: By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction. CONCLUSIONS: LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction. BioMed Central 2018-09-17 /pmc/articles/PMC6157193/ /pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Cherlin, Svetlana Howey, Richard A. J. Cordell, Heather J. Using penalized regression to predict phenotype from SNP data
title	Using penalized regression to predict phenotype from SNP data
title_full	Using penalized regression to predict phenotype from SNP data
title_fullStr	Using penalized regression to predict phenotype from SNP data
title_full_unstemmed	Using penalized regression to predict phenotype from SNP data
title_short	Using penalized regression to predict phenotype from SNP data
title_sort	using penalized regression to predict phenotype from snp data
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157193/ https://www.ncbi.nlm.nih.gov/pubmed/30275888 http://dx.doi.org/10.1186/s12919-018-0149-2
work_keys_str_mv	AT cherlinsvetlana usingpenalizedregressiontopredictphenotypefromsnpdata AT howeyrichardaj usingpenalizedregressiontopredictphenotypefromsnpdata AT cordellheatherj usingpenalizedregressiontopredictphenotypefromsnpdata

Using penalized regression to predict phenotype from SNP data

Ejemplares similares