Cargando…

Genome-wide prediction using Bayesian additive regression trees

BACKGROUND: The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on several thousands of individuals. A la...

Descripción completa

Detalles Bibliográficos
Autor principal:	Waldmann, Patrik
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4901500/ https://www.ncbi.nlm.nih.gov/pubmed/27286957 http://dx.doi.org/10.1186/s12711-016-0219-8

_version_	1782436817982717952
author	Waldmann, Patrik
author_facet	Waldmann, Patrik
author_sort	Waldmann, Patrik
collection	PubMed
description	BACKGROUND: The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on several thousands of individuals. A large number of methods have been developed for GWP, which are mostly parametric methods that assume statistical linearity and only additive genetic effects. The Bayesian additive regression trees (BART) method was recently proposed and is based on the sum of nonparametric regression trees with the priors being used to regularize the parameters. Each regression tree is based on a recursive binary partitioning of the predictor space that approximates an unknown function, which will automatically model nonlinearities within SNPs (dominance) and interactions between SNPs (epistasis). In this study, we introduced BART and compared its predictive performance with that of the LASSO, Bayesian LASSO (BLASSO), genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) regression and random forest (RF) methods. RESULTS: Tests on the QTLMAS2010 simulated data, which are mainly based on additive genetic effects, show that cross-validated optimization of BART provides a smaller prediction error than the RF, BLASSO, GBLUP and RKHS methods, and is almost as accurate as the LASSO method. If dominance and epistasis effects are added to the QTLMAS2010 data, the accuracy of BART relative to the other methods was increased. We also showed that BART can produce importance measures on the SNPs through variable inclusion proportions. In evaluations using real data on pigs, the prediction error was smaller with BART than with the other methods. CONCLUSIONS: BART was shown to be an accurate method for GWP, in which the regression trees guarantee a very sparse representation of additive and complex non-additive genetic effects. Moreover, the Markov chain Monte Carlo algorithm with Bayesian back-fitting provides a computationally efficient procedure that is suitable for high-dimensional genomic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-016-0219-8) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4901500
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49015002016-06-11 Genome-wide prediction using Bayesian additive regression trees Waldmann, Patrik Genet Sel Evol Research Article BACKGROUND: The goal of genome-wide prediction (GWP) is to predict phenotypes based on marker genotypes, often obtained through single nucleotide polymorphism (SNP) chips. The major problem with GWP is high-dimensional data from many thousands of SNPs scored on several thousands of individuals. A large number of methods have been developed for GWP, which are mostly parametric methods that assume statistical linearity and only additive genetic effects. The Bayesian additive regression trees (BART) method was recently proposed and is based on the sum of nonparametric regression trees with the priors being used to regularize the parameters. Each regression tree is based on a recursive binary partitioning of the predictor space that approximates an unknown function, which will automatically model nonlinearities within SNPs (dominance) and interactions between SNPs (epistasis). In this study, we introduced BART and compared its predictive performance with that of the LASSO, Bayesian LASSO (BLASSO), genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) regression and random forest (RF) methods. RESULTS: Tests on the QTLMAS2010 simulated data, which are mainly based on additive genetic effects, show that cross-validated optimization of BART provides a smaller prediction error than the RF, BLASSO, GBLUP and RKHS methods, and is almost as accurate as the LASSO method. If dominance and epistasis effects are added to the QTLMAS2010 data, the accuracy of BART relative to the other methods was increased. We also showed that BART can produce importance measures on the SNPs through variable inclusion proportions. In evaluations using real data on pigs, the prediction error was smaller with BART than with the other methods. CONCLUSIONS: BART was shown to be an accurate method for GWP, in which the regression trees guarantee a very sparse representation of additive and complex non-additive genetic effects. Moreover, the Markov chain Monte Carlo algorithm with Bayesian back-fitting provides a computationally efficient procedure that is suitable for high-dimensional genomic data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-016-0219-8) contains supplementary material, which is available to authorized users. BioMed Central 2016-06-10 /pmc/articles/PMC4901500/ /pubmed/27286957 http://dx.doi.org/10.1186/s12711-016-0219-8 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Waldmann, Patrik Genome-wide prediction using Bayesian additive regression trees
title	Genome-wide prediction using Bayesian additive regression trees
title_full	Genome-wide prediction using Bayesian additive regression trees
title_fullStr	Genome-wide prediction using Bayesian additive regression trees
title_full_unstemmed	Genome-wide prediction using Bayesian additive regression trees
title_short	Genome-wide prediction using Bayesian additive regression trees
title_sort	genome-wide prediction using bayesian additive regression trees
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4901500/ https://www.ncbi.nlm.nih.gov/pubmed/27286957 http://dx.doi.org/10.1186/s12711-016-0219-8
work_keys_str_mv	AT waldmannpatrik genomewidepredictionusingbayesianadditiveregressiontrees

Genome-wide prediction using Bayesian additive regression trees

Ejemplares similares