Cargando…
RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Society of Gene & Cell Therapy
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859278/ https://www.ncbi.nlm.nih.gov/pubmed/31726390 http://dx.doi.org/10.1016/j.omtn.2019.10.008 |
_version_ | 1783471095638130688 |
---|---|
author | Fang, Ting Zhang, Zizheng Sun, Rui Zhu, Lin He, Jingjing Huang, Bei Xiong, Yi Zhu, Xiaolei |
author_facet | Fang, Ting Zhang, Zizheng Sun, Rui Zhu, Lin He, Jingjing Huang, Bei Xiong, Yi Zhu, Xiaolei |
author_sort | Fang, Ting |
collection | PubMed |
description | 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. |
format | Online Article Text |
id | pubmed-6859278 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | American Society of Gene & Cell Therapy |
record_format | MEDLINE/PubMed |
spelling | pubmed-68592782019-11-22 RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition Fang, Ting Zhang, Zizheng Sun, Rui Zhu, Lin He, Jingjing Huang, Bei Xiong, Yi Zhu, Xiaolei Mol Ther Nucleic Acids Article 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. American Society of Gene & Cell Therapy 2019-10-18 /pmc/articles/PMC6859278/ /pubmed/31726390 http://dx.doi.org/10.1016/j.omtn.2019.10.008 Text en © 2019 The Authors http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Article Fang, Ting Zhang, Zizheng Sun, Rui Zhu, Lin He, Jingjing Huang, Bei Xiong, Yi Zhu, Xiaolei RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_full | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_fullStr | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_full_unstemmed | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_short | RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition |
title_sort | rnam5cpred: prediction of rna 5-methylcytosine sites based on three different kinds of nucleotide composition |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859278/ https://www.ncbi.nlm.nih.gov/pubmed/31726390 http://dx.doi.org/10.1016/j.omtn.2019.10.008 |
work_keys_str_mv | AT fangting rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT zhangzizheng rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT sunrui rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT zhulin rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT hejingjing rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT huangbei rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT xiongyi rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition AT zhuxiaolei rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition |