Cargando…

A comparison of methods for training population optimization in genomic selection

KEY MESSAGE: Maximizing CDmean and Avg_GRM_self were the best criteria for training set optimization. A training set size of 50–55% (targeted) or 65–85% (untargeted) is needed to obtain 95% of the accuracy.  ABSTRACT: With the advent of genomic selection (GS) as a widespread breeding tool, mechanism...

Descripción completa

Detalles Bibliográficos
Autores principales: Fernández-González, Javier, Akdemir, Deniz, Isidro y Sánchez, Julio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Berlin Heidelberg 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9998580/
https://www.ncbi.nlm.nih.gov/pubmed/36892603
http://dx.doi.org/10.1007/s00122-023-04265-6
_version_ 1784903496670117888
author Fernández-González, Javier
Akdemir, Deniz
Isidro y Sánchez, Julio
author_facet Fernández-González, Javier
Akdemir, Deniz
Isidro y Sánchez, Julio
author_sort Fernández-González, Javier
collection PubMed
description KEY MESSAGE: Maximizing CDmean and Avg_GRM_self were the best criteria for training set optimization. A training set size of 50–55% (targeted) or 65–85% (untargeted) is needed to obtain 95% of the accuracy.  ABSTRACT: With the advent of genomic selection (GS) as a widespread breeding tool, mechanisms to efficiently design an optimal training set for GS models became more relevant, since they allow maximizing the accuracy while minimizing the phenotyping costs. The literature described many training set optimization methods, but there is a lack of a comprehensive comparison among them. This work aimed to provide an extensive benchmark among optimization methods and optimal training set size by testing a wide range of them in seven datasets, six different species, different genetic architectures, population structure, heritabilities, and with several GS models to provide some guidelines about their application in breeding programs. Our results showed that targeted optimization (uses information from the test set) performed better than untargeted (does not use test set data), especially when heritability was low. The mean coefficient of determination was the best targeted method, although it was computationally intensive. Minimizing the average relationship within the training set was the best strategy for untargeted optimization. Regarding the optimal training set size, maximum accuracy was obtained when the training set was the entire candidate set. Nevertheless, a 50–55% of the candidate set was enough to reach 95–100% of the maximum accuracy in the targeted scenario, while we needed a 65–85% for untargeted optimization. Our results also suggested that a diverse training set makes GS robust against population structure, while including clustering information was less effective. The choice of the GS model did not have a significant influence on the prediction accuracies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00122-023-04265-6.
format Online
Article
Text
id pubmed-9998580
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer Berlin Heidelberg
record_format MEDLINE/PubMed
spelling pubmed-99985802023-03-11 A comparison of methods for training population optimization in genomic selection Fernández-González, Javier Akdemir, Deniz Isidro y Sánchez, Julio Theor Appl Genet Original Article KEY MESSAGE: Maximizing CDmean and Avg_GRM_self were the best criteria for training set optimization. A training set size of 50–55% (targeted) or 65–85% (untargeted) is needed to obtain 95% of the accuracy.  ABSTRACT: With the advent of genomic selection (GS) as a widespread breeding tool, mechanisms to efficiently design an optimal training set for GS models became more relevant, since they allow maximizing the accuracy while minimizing the phenotyping costs. The literature described many training set optimization methods, but there is a lack of a comprehensive comparison among them. This work aimed to provide an extensive benchmark among optimization methods and optimal training set size by testing a wide range of them in seven datasets, six different species, different genetic architectures, population structure, heritabilities, and with several GS models to provide some guidelines about their application in breeding programs. Our results showed that targeted optimization (uses information from the test set) performed better than untargeted (does not use test set data), especially when heritability was low. The mean coefficient of determination was the best targeted method, although it was computationally intensive. Minimizing the average relationship within the training set was the best strategy for untargeted optimization. Regarding the optimal training set size, maximum accuracy was obtained when the training set was the entire candidate set. Nevertheless, a 50–55% of the candidate set was enough to reach 95–100% of the maximum accuracy in the targeted scenario, while we needed a 65–85% for untargeted optimization. Our results also suggested that a diverse training set makes GS robust against population structure, while including clustering information was less effective. The choice of the GS model did not have a significant influence on the prediction accuracies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00122-023-04265-6. Springer Berlin Heidelberg 2023-03-09 2023 /pmc/articles/PMC9998580/ /pubmed/36892603 http://dx.doi.org/10.1007/s00122-023-04265-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Article
Fernández-González, Javier
Akdemir, Deniz
Isidro y Sánchez, Julio
A comparison of methods for training population optimization in genomic selection
title A comparison of methods for training population optimization in genomic selection
title_full A comparison of methods for training population optimization in genomic selection
title_fullStr A comparison of methods for training population optimization in genomic selection
title_full_unstemmed A comparison of methods for training population optimization in genomic selection
title_short A comparison of methods for training population optimization in genomic selection
title_sort comparison of methods for training population optimization in genomic selection
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9998580/
https://www.ncbi.nlm.nih.gov/pubmed/36892603
http://dx.doi.org/10.1007/s00122-023-04265-6
work_keys_str_mv AT fernandezgonzalezjavier acomparisonofmethodsfortrainingpopulationoptimizationingenomicselection
AT akdemirdeniz acomparisonofmethodsfortrainingpopulationoptimizationingenomicselection
AT isidroysanchezjulio acomparisonofmethodsfortrainingpopulationoptimizationingenomicselection
AT fernandezgonzalezjavier comparisonofmethodsfortrainingpopulationoptimizationingenomicselection
AT akdemirdeniz comparisonofmethodsfortrainingpopulationoptimizationingenomicselection
AT isidroysanchezjulio comparisonofmethodsfortrainingpopulationoptimizationingenomicselection