Cargando…

Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods

The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Bo, Zhang, Nanxi, Wang, You-Gan, George, Andrew W., Reverter, Antonio, Li, Yutao
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2018
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6039760/ https://www.ncbi.nlm.nih.gov/pubmed/30023001 http://dx.doi.org/10.3389/fgene.2018.00237

_version_	1783338739246825472
author	Li, Bo Zhang, Nanxi Wang, You-Gan George, Andrew W. Reverter, Antonio Li, Yutao
author_facet	Li, Bo Zhang, Nanxi Wang, You-Gan George, Andrew W. Reverter, Antonio Li, Yutao
author_sort	Li, Bo
collection	PubMed
description	The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy.
format	Online Article Text
id	pubmed-6039760
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-60397602018-07-18 Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods Li, Bo Zhang, Nanxi Wang, You-Gan George, Andrew W. Reverter, Antonio Li, Yutao Front Genet Genetics The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy. Frontiers Media S.A. 2018-07-04 /pmc/articles/PMC6039760/ /pubmed/30023001 http://dx.doi.org/10.3389/fgene.2018.00237 Text en Copyright © 2018 Li, Zhang, Wang, George, Reverter and Li. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Li, Bo Zhang, Nanxi Wang, You-Gan George, Andrew W. Reverter, Antonio Li, Yutao Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title	Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title_full	Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title_fullStr	Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title_full_unstemmed	Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title_short	Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods
title_sort	genomic prediction of breeding values using a subset of snps identified by three machine learning methods
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6039760/ https://www.ncbi.nlm.nih.gov/pubmed/30023001 http://dx.doi.org/10.3389/fgene.2018.00237
work_keys_str_mv	AT libo genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods AT zhangnanxi genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods AT wangyougan genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods AT georgeandreww genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods AT reverterantonio genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods AT liyutao genomicpredictionofbreedingvaluesusingasubsetofsnpsidentifiedbythreemachinelearningmethods

Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods

Ejemplares similares