Cargando…

Robust predictions of specialized metabolism genes through machine learning

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to a...

Descripción completa

Detalles Bibliográficos
Autores principales: Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, Shiu, Shin-Han
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6369796/
https://www.ncbi.nlm.nih.gov/pubmed/30674669
http://dx.doi.org/10.1073/pnas.1817074116
_version_ 1783394249543254016
author Moore, Bethany M.
Wang, Peipei
Fan, Pengxiang
Leong, Bryan
Schenck, Craig A.
Lloyd, John P.
Lehti-Shiu, Melissa D.
Last, Robert L.
Pichersky, Eran
Shiu, Shin-Han
author_facet Moore, Bethany M.
Wang, Peipei
Fan, Pengxiang
Leong, Bryan
Schenck, Craig A.
Lloyd, John P.
Lehti-Shiu, Melissa D.
Last, Robert L.
Pichersky, Eran
Shiu, Shin-Han
author_sort Moore, Bethany M.
collection PubMed
description Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.
format Online
Article
Text
id pubmed-6369796
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-63697962019-02-14 Robust predictions of specialized metabolism genes through machine learning Moore, Bethany M. Wang, Peipei Fan, Pengxiang Leong, Bryan Schenck, Craig A. Lloyd, John P. Lehti-Shiu, Melissa D. Last, Robert L. Pichersky, Eran Shiu, Shin-Han Proc Natl Acad Sci U S A PNAS Plus Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome. National Academy of Sciences 2019-02-05 2019-01-23 /pmc/articles/PMC6369796/ /pubmed/30674669 http://dx.doi.org/10.1073/pnas.1817074116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle PNAS Plus
Moore, Bethany M.
Wang, Peipei
Fan, Pengxiang
Leong, Bryan
Schenck, Craig A.
Lloyd, John P.
Lehti-Shiu, Melissa D.
Last, Robert L.
Pichersky, Eran
Shiu, Shin-Han
Robust predictions of specialized metabolism genes through machine learning
title Robust predictions of specialized metabolism genes through machine learning
title_full Robust predictions of specialized metabolism genes through machine learning
title_fullStr Robust predictions of specialized metabolism genes through machine learning
title_full_unstemmed Robust predictions of specialized metabolism genes through machine learning
title_short Robust predictions of specialized metabolism genes through machine learning
title_sort robust predictions of specialized metabolism genes through machine learning
topic PNAS Plus
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6369796/
https://www.ncbi.nlm.nih.gov/pubmed/30674669
http://dx.doi.org/10.1073/pnas.1817074116
work_keys_str_mv AT moorebethanym robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT wangpeipei robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT fanpengxiang robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT leongbryan robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT schenckcraiga robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT lloydjohnp robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT lehtishiumelissad robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT lastrobertl robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT picherskyeran robustpredictionsofspecializedmetabolismgenesthroughmachinelearning
AT shiushinhan robustpredictionsofspecializedmetabolismgenesthroughmachinelearning