Cargando…
A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information
C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panic...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368328/ https://www.ncbi.nlm.nih.gov/pubmed/37462292 http://dx.doi.org/10.1093/gbe/evad129 |
_version_ | 1785077488057057280 |
---|---|
author | Yogadasan, Nilanth Doxey, Andrew C Chuong, Simon D X |
author_facet | Yogadasan, Nilanth Doxey, Andrew C Chuong, Simon D X |
author_sort | Yogadasan, Nilanth |
collection | PubMed |
description | C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C(4) photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C(3) and C(4) distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C(3)/C(4) classification models. Our RbcL-trained model, for example, informs a C(3)/C(4) classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C(3) versus C(4) metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C(3)/C(4) status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C(4) photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships. |
format | Online Article Text |
id | pubmed-10368328 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-103683282023-07-26 A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information Yogadasan, Nilanth Doxey, Andrew C Chuong, Simon D X Genome Biol Evol Article C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C(4) photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C(3) and C(4) distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C(3)/C(4) classification models. Our RbcL-trained model, for example, informs a C(3)/C(4) classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C(3) versus C(4) metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C(3)/C(4) status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C(4) photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships. Oxford University Press 2023-07-18 /pmc/articles/PMC10368328/ /pubmed/37462292 http://dx.doi.org/10.1093/gbe/evad129 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Article Yogadasan, Nilanth Doxey, Andrew C Chuong, Simon D X A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title | A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title_full | A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title_fullStr | A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title_full_unstemmed | A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title_short | A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information |
title_sort | machine learning framework identifies plastid-encoded proteins harboring c(3) and c(4) distinguishing sequence information |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368328/ https://www.ncbi.nlm.nih.gov/pubmed/37462292 http://dx.doi.org/10.1093/gbe/evad129 |
work_keys_str_mv | AT yogadasannilanth amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation AT doxeyandrewc amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation AT chuongsimondx amachinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation AT yogadasannilanth machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation AT doxeyandrewc machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation AT chuongsimondx machinelearningframeworkidentifiesplastidencodedproteinsharboringc3andc4distinguishingsequenceinformation |