Cargando…

Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection

BACKGROUND: Use of whole-genome sequence data is expected to increase persistency of genomic prediction across generations and breeds but affects model performance and requires increased computing time. In this study, we investigated whether the split-and-merge Bayesian stochastic search variable se...

Descripción completa

Detalles Bibliográficos
Autores principales:	Calus, Mario P. L., Bouwman, Aniek C., Schrooten, Chris, Veerkamp, Roel F.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926307/ https://www.ncbi.nlm.nih.gov/pubmed/27357580 http://dx.doi.org/10.1186/s12711-016-0225-x

_version_	1782440084809711616
author	Calus, Mario P. L. Bouwman, Aniek C. Schrooten, Chris Veerkamp, Roel F.
author_facet	Calus, Mario P. L. Bouwman, Aniek C. Schrooten, Chris Veerkamp, Roel F.
author_sort	Calus, Mario P. L.
collection	PubMed
description	BACKGROUND: Use of whole-genome sequence data is expected to increase persistency of genomic prediction across generations and breeds but affects model performance and requires increased computing time. In this study, we investigated whether the split-and-merge Bayesian stochastic search variable selection (BSSVS) model could overcome these issues. BSSVS is performed first on subsets of sequence-based variants and then on a merged dataset containing variants selected in the first step. RESULTS: We used a dataset that included 4,154,064 variants after editing and de-regressed proofs for 3415 reference and 2138 validation bulls for somatic cell score, protein yield and interval first to last insemination. In the first step, BSSVS was performed on 106 subsets each containing ~39,189 variants. In the second step, 1060 up to 472,492 variants, selected from the first step, were included to estimate the accuracy of genomic prediction. Accuracies were at best equal to those achieved with the commonly used Bovine 50k-SNP chip, although the number of variants within a few well-known quantitative trait loci regions was considerably enriched. When variant selection and the final genomic prediction were performed on the same data, predictions were biased. Predictions computed as the average of the predictions computed for each subset achieved the highest accuracies, i.e. 0.5 to 1.1 % higher than the accuracies obtained with the 50k-SNP chip, and yielded the least biased predictions. Finally, the accuracy of genomic predictions obtained when all sequence-based variants were included was similar or up to 1.4 % lower compared to that based on the average predictions across the subsets. By applying parallelization, the split-and-merge procedure was completed in 5 days, while the standard analysis including all sequence-based variants took more than three months. CONCLUSIONS: The split-and-merge approach splits one large computational task into many much smaller ones, which allows the use of parallel processing and thus efficient genomic prediction based on whole-genome sequence data. The split-and-merge approach did not improve prediction accuracy, probably because we used data on a single breed for which relationships between individuals were high. Nevertheless, the split-and-merge approach may have potential for applications on data from multiple breeds. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-016-0225-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4926307
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49263072016-06-29 Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection Calus, Mario P. L. Bouwman, Aniek C. Schrooten, Chris Veerkamp, Roel F. Genet Sel Evol Research Article BACKGROUND: Use of whole-genome sequence data is expected to increase persistency of genomic prediction across generations and breeds but affects model performance and requires increased computing time. In this study, we investigated whether the split-and-merge Bayesian stochastic search variable selection (BSSVS) model could overcome these issues. BSSVS is performed first on subsets of sequence-based variants and then on a merged dataset containing variants selected in the first step. RESULTS: We used a dataset that included 4,154,064 variants after editing and de-regressed proofs for 3415 reference and 2138 validation bulls for somatic cell score, protein yield and interval first to last insemination. In the first step, BSSVS was performed on 106 subsets each containing ~39,189 variants. In the second step, 1060 up to 472,492 variants, selected from the first step, were included to estimate the accuracy of genomic prediction. Accuracies were at best equal to those achieved with the commonly used Bovine 50k-SNP chip, although the number of variants within a few well-known quantitative trait loci regions was considerably enriched. When variant selection and the final genomic prediction were performed on the same data, predictions were biased. Predictions computed as the average of the predictions computed for each subset achieved the highest accuracies, i.e. 0.5 to 1.1 % higher than the accuracies obtained with the 50k-SNP chip, and yielded the least biased predictions. Finally, the accuracy of genomic predictions obtained when all sequence-based variants were included was similar or up to 1.4 % lower compared to that based on the average predictions across the subsets. By applying parallelization, the split-and-merge procedure was completed in 5 days, while the standard analysis including all sequence-based variants took more than three months. CONCLUSIONS: The split-and-merge approach splits one large computational task into many much smaller ones, which allows the use of parallel processing and thus efficient genomic prediction based on whole-genome sequence data. The split-and-merge approach did not improve prediction accuracy, probably because we used data on a single breed for which relationships between individuals were high. Nevertheless, the split-and-merge approach may have potential for applications on data from multiple breeds. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12711-016-0225-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-06-29 /pmc/articles/PMC4926307/ /pubmed/27357580 http://dx.doi.org/10.1186/s12711-016-0225-x Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Calus, Mario P. L. Bouwman, Aniek C. Schrooten, Chris Veerkamp, Roel F. Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title	Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title_full	Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title_fullStr	Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title_full_unstemmed	Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title_short	Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
title_sort	efficient genomic prediction based on whole-genome sequence data using split-and-merge bayesian variable selection
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926307/ https://www.ncbi.nlm.nih.gov/pubmed/27357580 http://dx.doi.org/10.1186/s12711-016-0225-x
work_keys_str_mv	AT calusmariopl efficientgenomicpredictionbasedonwholegenomesequencedatausingsplitandmergebayesianvariableselection AT bouwmananiekc efficientgenomicpredictionbasedonwholegenomesequencedatausingsplitandmergebayesianvariableselection AT schrootenchris efficientgenomicpredictionbasedonwholegenomesequencedatausingsplitandmergebayesianvariableselection AT veerkamproelf efficientgenomicpredictionbasedonwholegenomesequencedatausingsplitandmergebayesianvariableselection

Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection

Ejemplares similares