Cargando…
A comparison of principal component regression and genomic REML for genomic prediction across populations
BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220066/ https://www.ncbi.nlm.nih.gov/pubmed/25370926 http://dx.doi.org/10.1186/s12711-014-0060-x |
_version_ | 1782342688311345152 |
---|---|
author | Dadousis, Christos Veerkamp, Roel F Heringstad, Bjørg Pszczola, Marcin Calus, Mario PL |
author_facet | Dadousis, Christos Veerkamp, Roel F Heringstad, Bjørg Pszczola, Marcin Calus, Mario PL |
author_sort | Dadousis, Christos |
collection | PubMed |
description | BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. METHODS: The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. RESULTS: In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. CONCLUSIONS: On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-014-0060-x) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4220066 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42200662014-11-07 A comparison of principal component regression and genomic REML for genomic prediction across populations Dadousis, Christos Veerkamp, Roel F Heringstad, Bjørg Pszczola, Marcin Calus, Mario PL Genet Sel Evol Research BACKGROUND: Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. METHODS: The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. RESULTS: In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. CONCLUSIONS: On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-014-0060-x) contains supplementary material, which is available to authorized users. BioMed Central 2014-11-05 /pmc/articles/PMC4220066/ /pubmed/25370926 http://dx.doi.org/10.1186/s12711-014-0060-x Text en © Dadousis et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Dadousis, Christos Veerkamp, Roel F Heringstad, Bjørg Pszczola, Marcin Calus, Mario PL A comparison of principal component regression and genomic REML for genomic prediction across populations |
title | A comparison of principal component regression and genomic REML for genomic prediction across populations |
title_full | A comparison of principal component regression and genomic REML for genomic prediction across populations |
title_fullStr | A comparison of principal component regression and genomic REML for genomic prediction across populations |
title_full_unstemmed | A comparison of principal component regression and genomic REML for genomic prediction across populations |
title_short | A comparison of principal component regression and genomic REML for genomic prediction across populations |
title_sort | comparison of principal component regression and genomic reml for genomic prediction across populations |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4220066/ https://www.ncbi.nlm.nih.gov/pubmed/25370926 http://dx.doi.org/10.1186/s12711-014-0060-x |
work_keys_str_mv | AT dadousischristos acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT veerkamproelf acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT heringstadbjørg acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT pszczolamarcin acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT calusmariopl acomparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT dadousischristos comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT veerkamproelf comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT heringstadbjørg comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT pszczolamarcin comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations AT calusmariopl comparisonofprincipalcomponentregressionandgenomicremlforgenomicpredictionacrosspopulations |