Cargando…

A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information

C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panic...

Descripción completa

Detalles Bibliográficos
Autores principales: Yogadasan, Nilanth, Doxey, Andrew C, Chuong, Simon D X
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368328/
https://www.ncbi.nlm.nih.gov/pubmed/37462292
http://dx.doi.org/10.1093/gbe/evad129
_version_ 1785077488057057280
author Yogadasan, Nilanth
Doxey, Andrew C
Chuong, Simon D X
author_facet Yogadasan, Nilanth
Doxey, Andrew C
Chuong, Simon D X
author_sort Yogadasan, Nilanth
collection PubMed
description C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C(4) photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C(3) and C(4) distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C(3)/C(4) classification models. Our RbcL-trained model, for example, informs a C(3)/C(4) classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C(3) versus C(4) metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C(3)/C(4) status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C(4) photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships.
format Online
Article
Text
id pubmed-10368328
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103683282023-07-26 A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information Yogadasan, Nilanth Doxey, Andrew C Chuong, Simon D X Genome Biol Evol Article C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C(4) photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C(3) and C(4) distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C(3)/C(4) classification models. Our RbcL-trained model, for example, informs a C(3)/C(4) classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C(3) versus C(4) metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C(3)/C(4) status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C(4) photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships. Oxford University Press 2023-07-18 /pmc/articles/PMC10368328/ /pubmed/37462292 http://dx.doi.org/10.1093/gbe/evad129 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Article
Yogadasan, Nilanth
Doxey, Andrew C
Chuong, Simon D X
A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title_full A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title_fullStr A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title_full_unstemmed A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title_short A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
title_sort machine learning framework identifies plastid-encoded proteins harboring c(3) and c(4) distinguishing sequence information
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368328/
https://www.ncbi.nlm.nih.gov/pubmed/37462292
http://dx.doi.org/10.1093/gbe/evad129
work_keys_str_mv AT yogadasannilanth amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation
AT doxeyandrewc amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation
AT chuongsimondx amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation
AT yogadasannilanth machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation
AT doxeyandrewc machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation
AT chuongsimondx machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation