Cargando…

A comparison of principal component regression and genomic REML for genomic prediction across populations

BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to...

Descripción completa

Detalles Bibliográficos
Autores principales: Dadousis, Christos, Veerkamp, Roel F, Heringstad, Bjørg, Pszczola, Marcin, Calus, Mario PL
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220066/
https://www.ncbi.nlm.nih.gov/pubmed/25370926
http://dx.doi.org/10.1186/s12711-014-0060-x
_version_ 1782342688311345152
author Dadousis, Christos
Veerkamp, Roel F
Heringstad, Bjørg
Pszczola, Marcin
Calus, Mario PL
author_facet Dadousis, Christos
Veerkamp, Roel F
Heringstad, Bjørg
Pszczola, Marcin
Calus, Mario PL
author_sort Dadousis, Christos
collection PubMed
description BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. METHODS: The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. RESULTS: In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. CONCLUSIONS: On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-014-0060-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4220066
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42200662014-11-07 A comparison of principal component regression and genomic REML for genomic prediction across populations Dadousis, Christos Veerkamp, Roel F Heringstad, Bjørg Pszczola, Marcin Calus, Mario PL Genet Sel Evol Research BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. METHODS: The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. RESULTS: In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. CONCLUSIONS: On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-014-0060-x) contains supplementary material, which is available to authorized users. BioMed Central 2014-11-05 /pmc/articles/PMC4220066/ /pubmed/25370926 http://dx.doi.org/10.1186/s12711-014-0060-x Text en © Dadousis et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Dadousis, Christos
Veerkamp, Roel F
Heringstad, Bjørg
Pszczola, Marcin
Calus, Mario PL
A comparison of principal component regression and genomic REML for genomic prediction across populations
title A comparison of principal component regression and genomic REML for genomic prediction across populations
title_full A comparison of principal component regression and genomic REML for genomic prediction across populations
title_fullStr A comparison of principal component regression and genomic REML for genomic prediction across populations
title_full_unstemmed A comparison of principal component regression and genomic REML for genomic prediction across populations
title_short A comparison of principal component regression and genomic REML for genomic prediction across populations
title_sort comparison of principal component regression and genomic reml for genomic prediction across populations
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220066/
https://www.ncbi.nlm.nih.gov/pubmed/25370926
http://dx.doi.org/10.1186/s12711-014-0060-x
work_keys_str_mv AT dadousischristos acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT veerkamproelf acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT heringstadbjørg acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT pszczolamarcin acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT calusmariopl acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT dadousischristos comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT veerkamproelf comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT heringstadbjørg comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT pszczolamarcin comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations
AT calusmariopl comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations