Cargando…

A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C(3) and C(4) Distinguishing Sequence Information

C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panic...

Descripción completa

Detalles Bibliográficos
Autores principales: Yogadasan, Nilanth, Doxey, Andrew C, Chuong, Simon D X
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368328/
https://www.ncbi.nlm.nih.gov/pubmed/37462292
http://dx.doi.org/10.1093/gbe/evad129
Descripción
Sumario:C(4) photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C(4) species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C(4) photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C(3) and C(4) distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C(3)/C(4) classification models. Our RbcL-trained model, for example, informs a C(3)/C(4) classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C(3) versus C(4) metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C(3)/C(4) status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C(4) photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships.