Cargando…

Identification of the expressome by machine learning on omics data

Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic...

Descripción completa

Detalles Bibliográficos
Autores principales: Sartor, Ryan C., Noshay, Jaclyn, Springer, Nathan M., Briggs, Steven P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6731682/
https://www.ncbi.nlm.nih.gov/pubmed/31420517
http://dx.doi.org/10.1073/pnas.1813645116
_version_ 1783449713280811008
author Sartor, Ryan C.
Noshay, Jaclyn
Springer, Nathan M.
Briggs, Steven P.
author_facet Sartor, Ryan C.
Noshay, Jaclyn
Springer, Nathan M.
Briggs, Steven P.
author_sort Sartor, Ryan C.
collection PubMed
description Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.
format Online
Article
Text
id pubmed-6731682
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-67316822019-09-18 Identification of the expressome by machine learning on omics data Sartor, Ryan C. Noshay, Jaclyn Springer, Nathan M. Briggs, Steven P. Proc Natl Acad Sci U S A Biological Sciences Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes. National Academy of Sciences 2019-09-03 2019-08-16 /pmc/articles/PMC6731682/ /pubmed/31420517 http://dx.doi.org/10.1073/pnas.1813645116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Sartor, Ryan C.
Noshay, Jaclyn
Springer, Nathan M.
Briggs, Steven P.
Identification of the expressome by machine learning on omics data
title Identification of the expressome by machine learning on omics data
title_full Identification of the expressome by machine learning on omics data
title_fullStr Identification of the expressome by machine learning on omics data
title_full_unstemmed Identification of the expressome by machine learning on omics data
title_short Identification of the expressome by machine learning on omics data
title_sort identification of the expressome by machine learning on omics data
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6731682/
https://www.ncbi.nlm.nih.gov/pubmed/31420517
http://dx.doi.org/10.1073/pnas.1813645116
work_keys_str_mv AT sartorryanc identificationoftheexpressomebymachinelearningonomicsdata
AT noshayjaclyn identificationoftheexpressomebymachinelearningonomicsdata
AT springernathanm identificationoftheexpressomebymachinelearningonomicsdata
AT briggsstevenp identificationoftheexpressomebymachinelearningonomicsdata