Cargando…

Prediction of plant lncRNA by ensemble machine learning classifiers

BACKGROUND: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With compar...

Descripción completa

Detalles Bibliográficos
Autores principales: Simopoulos, Caitlin M. A., Weretilnyk, Elizabeth A., Golding, G. Brian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930664/
https://www.ncbi.nlm.nih.gov/pubmed/29720103
http://dx.doi.org/10.1186/s12864-018-4665-2
_version_ 1783319519112986624
author Simopoulos, Caitlin M. A.
Weretilnyk, Elizabeth A.
Golding, G. Brian
author_facet Simopoulos, Caitlin M. A.
Weretilnyk, Elizabeth A.
Golding, G. Brian
author_sort Simopoulos, Caitlin M. A.
collection PubMed
description BACKGROUND: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. RESULTS: Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. CONCLUSIONS: This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-4665-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5930664
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-59306642018-05-09 Prediction of plant lncRNA by ensemble machine learning classifiers Simopoulos, Caitlin M. A. Weretilnyk, Elizabeth A. Golding, G. Brian BMC Genomics Methodology Article BACKGROUND: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. RESULTS: Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. CONCLUSIONS: This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-4665-2) contains supplementary material, which is available to authorized users. BioMed Central 2018-05-02 /pmc/articles/PMC5930664/ /pubmed/29720103 http://dx.doi.org/10.1186/s12864-018-4665-2 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Simopoulos, Caitlin M. A.
Weretilnyk, Elizabeth A.
Golding, G. Brian
Prediction of plant lncRNA by ensemble machine learning classifiers
title Prediction of plant lncRNA by ensemble machine learning classifiers
title_full Prediction of plant lncRNA by ensemble machine learning classifiers
title_fullStr Prediction of plant lncRNA by ensemble machine learning classifiers
title_full_unstemmed Prediction of plant lncRNA by ensemble machine learning classifiers
title_short Prediction of plant lncRNA by ensemble machine learning classifiers
title_sort prediction of plant lncrna by ensemble machine learning classifiers
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930664/
https://www.ncbi.nlm.nih.gov/pubmed/29720103
http://dx.doi.org/10.1186/s12864-018-4665-2
work_keys_str_mv AT simopouloscaitlinma predictionofplantlncrnabyensemblemachinelearningclassifiers
AT weretilnykelizabetha predictionofplantlncrnabyensemblemachinelearningclassifiers
AT goldinggbrian predictionofplantlncrnabyensemblemachinelearningclassifiers