Cargando…
Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification
Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10090538/ https://www.ncbi.nlm.nih.gov/pubmed/37065625 http://dx.doi.org/10.3389/fgene.2022.1032691 |
_version_ | 1785022980968939520 |
---|---|
author | Manthena, Vamsi Jarquín, Diego Howard, Reka |
author_facet | Manthena, Vamsi Jarquín, Diego Howard, Reka |
author_sort | Manthena, Vamsi |
collection | PubMed |
description | Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors. |
format | Online Article Text |
id | pubmed-10090538 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-100905382023-04-13 Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification Manthena, Vamsi Jarquín, Diego Howard, Reka Front Genet Genetics Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors. Frontiers Media S.A. 2023-03-29 /pmc/articles/PMC10090538/ /pubmed/37065625 http://dx.doi.org/10.3389/fgene.2022.1032691 Text en Copyright © 2023 Manthena, Jarquín and Howard. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Manthena, Vamsi Jarquín, Diego Howard, Reka Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title | Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title_full | Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title_fullStr | Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title_full_unstemmed | Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title_short | Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
title_sort | integrating and optimizing genomic, weather, and secondary trait data for multiclass classification |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10090538/ https://www.ncbi.nlm.nih.gov/pubmed/37065625 http://dx.doi.org/10.3389/fgene.2022.1032691 |
work_keys_str_mv | AT manthenavamsi integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification AT jarquindiego integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification AT howardreka integratingandoptimizinggenomicweatherandsecondarytraitdataformulticlassclassification |