Cargando…

Genomic prediction in plants: opportunities for ensemble machine learning based approaches

Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Farooq, Muhammad, van Dijk, Aalt D.J., Nijveen, Harm, Mansoor, Shahid, de Ridder, Dick
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10080209/ https://www.ncbi.nlm.nih.gov/pubmed/37035464 http://dx.doi.org/10.12688/f1000research.122437.2

_version_	1785020875838324736
author	Farooq, Muhammad van Dijk, Aalt D.J. Nijveen, Harm Mansoor, Shahid de Ridder, Dick
author_facet	Farooq, Muhammad van Dijk, Aalt D.J. Nijveen, Harm Mansoor, Shahid de Ridder, Dick
author_sort	Farooq, Muhammad
collection	PubMed
description	Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability ( h (2) and h (2) (e) ), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
format	Online Article Text
id	pubmed-10080209
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-100802092023-04-08 Genomic prediction in plants: opportunities for ensemble machine learning based approaches Farooq, Muhammad van Dijk, Aalt D.J. Nijveen, Harm Mansoor, Shahid de Ridder, Dick F1000Res Research Article Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability ( h (2) and h (2) (e) ), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners. F1000 Research Limited 2023-01-10 /pmc/articles/PMC10080209/ /pubmed/37035464 http://dx.doi.org/10.12688/f1000research.122437.2 Text en Copyright: © 2023 Farooq M et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Farooq, Muhammad van Dijk, Aalt D.J. Nijveen, Harm Mansoor, Shahid de Ridder, Dick Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title	Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title_full	Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title_fullStr	Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title_full_unstemmed	Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title_short	Genomic prediction in plants: opportunities for ensemble machine learning based approaches
title_sort	genomic prediction in plants: opportunities for ensemble machine learning based approaches
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10080209/ https://www.ncbi.nlm.nih.gov/pubmed/37035464 http://dx.doi.org/10.12688/f1000research.122437.2
work_keys_str_mv	AT farooqmuhammad genomicpredictioninplantsopportunitiesforensemblemachinelearningbasedapproaches AT vandijkaaltdj genomicpredictioninplantsopportunitiesforensemblemachinelearningbasedapproaches AT nijveenharm genomicpredictioninplantsopportunitiesforensemblemachinelearningbasedapproaches AT mansoorshahid genomicpredictioninplantsopportunitiesforensemblemachinelearningbasedapproaches AT deridderdick genomicpredictioninplantsopportunitiesforensemblemachinelearningbasedapproaches

Genomic prediction in plants: opportunities for ensemble machine learning based approaches

Ejemplares similares