Cargando…

PCirc: random forest-based plant circRNA identification software

BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing am...

Descripción completa

Detalles Bibliográficos
Autores principales: Yin, Shuwei, Tian, Xiao, Zhang, Jingjing, Sun, Peisen, Li, Guanglin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7789375/
https://www.ncbi.nlm.nih.gov/pubmed/33407069
http://dx.doi.org/10.1186/s12859-020-03944-1
_version_ 1783633225511337984
author Yin, Shuwei
Tian, Xiao
Zhang, Jingjing
Sun, Peisen
Li, Guanglin
author_facet Yin, Shuwei
Tian, Xiao
Zhang, Jingjing
Sun, Peisen
Li, Guanglin
author_sort Yin, Shuwei
collection PubMed
description BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. RESULTS: In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc (https://github.com/Lilab-SNNU/Pcirc). CONCLUSION: Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use.
format Online
Article
Text
id pubmed-7789375
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77893752021-01-07 PCirc: random forest-based plant circRNA identification software Yin, Shuwei Tian, Xiao Zhang, Jingjing Sun, Peisen Li, Guanglin BMC Bioinformatics Software BACKGROUND: Circular RNA (circRNA) is a novel type of RNA with a closed-loop structure. Increasing numbers of circRNAs are being identified in plants and animals, and recent studies have shown that circRNAs play an important role in gene regulation. Therefore, identifying circRNAs from increasing amounts of RNA-seq data is very important. However, traditional circRNA recognition methods have limitations. In recent years, emerging machine learning techniques have provided a good approach for the identification of circRNAs in animals. However, using these features to identify plant circRNAs is infeasible because the characteristics of plant circRNA sequences are different from those of animal circRNAs. For example, plants are extremely rich in splicing signals and transposable elements, and their sequence conservation in rice, for example is far less than that in mammals. To solve these problems and better identify circRNAs in plants, it is urgent to develop circRNA recognition software using machine learning based on the characteristics of plant circRNAs. RESULTS: In this study, we built a software program named PCirc using a machine learning method to predict plant circRNAs from RNA-seq data. First, we extracted different features, including open reading frames, numbers of k-mers, and splicing junction sequence coding, from rice circRNA and lncRNA data. Second, we trained a machine learning model by the random forest algorithm with tenfold cross-validation in the training set. Third, we evaluated our classification according to accuracy, precision, and F1 score, and all scores on the model test data were above 0.99. Fourth, we tested our model by other plant tests, and obtained good results, with accuracy scores above 0.8. Finally, we packaged the machine learning model built and the programming script used into a locally run circular RNA prediction software, Pcirc (https://github.com/Lilab-SNNU/Pcirc). CONCLUSION: Based on rice circRNA and lncRNA data, a machine learning model for plant circRNA recognition was constructed in this study using random forest algorithm, and the model can also be applied to plant circRNA recognition such as Arabidopsis thaliana and maize. At the same time, after the completion of model construction, the machine learning model constructed and the programming scripts used in this study are packaged into a localized circRNA prediction software Pcirc, which is convenient for plant circRNA researchers to use. BioMed Central 2021-01-06 /pmc/articles/PMC7789375/ /pubmed/33407069 http://dx.doi.org/10.1186/s12859-020-03944-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Yin, Shuwei
Tian, Xiao
Zhang, Jingjing
Sun, Peisen
Li, Guanglin
PCirc: random forest-based plant circRNA identification software
title PCirc: random forest-based plant circRNA identification software
title_full PCirc: random forest-based plant circRNA identification software
title_fullStr PCirc: random forest-based plant circRNA identification software
title_full_unstemmed PCirc: random forest-based plant circRNA identification software
title_short PCirc: random forest-based plant circRNA identification software
title_sort pcirc: random forest-based plant circrna identification software
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7789375/
https://www.ncbi.nlm.nih.gov/pubmed/33407069
http://dx.doi.org/10.1186/s12859-020-03944-1
work_keys_str_mv AT yinshuwei pcircrandomforestbasedplantcircrnaidentificationsoftware
AT tianxiao pcircrandomforestbasedplantcircrnaidentificationsoftware
AT zhangjingjing pcircrandomforestbasedplantcircrnaidentificationsoftware
AT sunpeisen pcircrandomforestbasedplantcircrnaidentificationsoftware
AT liguanglin pcircrandomforestbasedplantcircrnaidentificationsoftware