Cargando…

Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification

Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is...

Descripción completa

Detalles Bibliográficos
Autores principales: Manthena, Vamsi, Jarquín, Diego, Howard, Reka
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10090538/
https://www.ncbi.nlm.nih.gov/pubmed/37065625
http://dx.doi.org/10.3389/fgene.2022.1032691
_version_ 1785022980968939520
author Manthena, Vamsi
Jarquín, Diego
Howard, Reka
author_facet Manthena, Vamsi
Jarquín, Diego
Howard, Reka
author_sort Manthena, Vamsi
collection PubMed
description Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors.
format Online
Article
Text
id pubmed-10090538
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-100905382023-04-13 Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification Manthena, Vamsi Jarquín, Diego Howard, Reka Front Genet Genetics Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors. Frontiers Media S.A. 2023-03-29 /pmc/articles/PMC10090538/ /pubmed/37065625 http://dx.doi.org/10.3389/fgene.2022.1032691 Text en Copyright © 2023 Manthena, Jarquín and Howard. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Manthena, Vamsi
Jarquín, Diego
Howard, Reka
Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title_full Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title_fullStr Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title_full_unstemmed Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title_short Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
title_sort integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10090538/
https://www.ncbi.nlm.nih.gov/pubmed/37065625
http://dx.doi.org/10.3389/fgene.2022.1032691
work_keys_str_mv AT manthenavamsi integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification
AT jarquindiego integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification
AT howardreka integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification