Cargando…
PCirc: random forest-based plant circRNA identification software
BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing am...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7789375/ https://www.ncbi.nlm.nih.gov/pubmed/33407069 http://dx.doi.org/10.1186/s12859-020-03944-1 |
_version_ | 1783633225511337984 |
---|---|
author | Yin, Shuwei Tian, Xiao Zhang, Jingjing Sun, Peisen Li, Guanglin |
author_facet | Yin, Shuwei Tian, Xiao Zhang, Jingjing Sun, Peisen Li, Guanglin |
author_sort | Yin, Shuwei |
collection | PubMed |
description | BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. RESULTS: In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc (https://github.com/Lilab-SNNU/Pcirc). CONCLUSION: Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use. |
format | Online Article Text |
id | pubmed-7789375 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-77893752021-01-07 PCirc: random forest-based plant circRNA identification software Yin, Shuwei Tian, Xiao Zhang, Jingjing Sun, Peisen Li, Guanglin BMC Bioinformatics Software BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. RESULTS: In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc (https://github.com/Lilab-SNNU/Pcirc). CONCLUSION: Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use. BioMed Central 2021-01-06 /pmc/articles/PMC7789375/ /pubmed/33407069 http://dx.doi.org/10.1186/s12859-020-03944-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Yin, Shuwei Tian, Xiao Zhang, Jingjing Sun, Peisen Li, Guanglin PCirc: random forest-based plant circRNA identification software |
title | PCirc: random forest-based plant circRNA identification software |
title_full | PCirc: random forest-based plant circRNA identification software |
title_fullStr | PCirc: random forest-based plant circRNA identification software |
title_full_unstemmed | PCirc: random forest-based plant circRNA identification software |
title_short | PCirc: random forest-based plant circRNA identification software |
title_sort | pcirc: random forest-based plant circrna identification software |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7789375/ https://www.ncbi.nlm.nih.gov/pubmed/33407069 http://dx.doi.org/10.1186/s12859-020-03944-1 |
work_keys_str_mv | AT yinshuwei pcircrandomforestbasedplantcircrnaidentificationsoftware AT tianxiao pcircrandomforestbasedplantcircrnaidentificationsoftware AT zhangjingjing pcircrandomforestbasedplantcircrnaidentificationsoftware AT sunpeisen pcircrandomforestbasedplantcircrnaidentificationsoftware AT liguanglin pcircrandomforestbasedplantcircrnaidentificationsoftware |