Cargando…

RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition

5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites...

Descripción completa

Detalles Bibliográficos
Autores principales: Fang, Ting, Zhang, Zizheng, Sun, Rui, Zhu, Lin, He, Jingjing, Huang, Bei, Xiong, Yi, Zhu, Xiaolei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Gene & Cell Therapy 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859278/
https://www.ncbi.nlm.nih.gov/pubmed/31726390
http://dx.doi.org/10.1016/j.omtn.2019.10.008
_version_ 1783471095638130688
author Fang, Ting
Zhang, Zizheng
Sun, Rui
Zhu, Lin
He, Jingjing
Huang, Bei
Xiong, Yi
Zhu, Xiaolei
author_facet Fang, Ting
Zhang, Zizheng
Sun, Rui
Zhu, Lin
He, Jingjing
Huang, Bei
Xiong, Yi
Zhu, Xiaolei
author_sort Fang, Ting
collection PubMed
description 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research.
format Online
Article
Text
id pubmed-6859278
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher American Society of Gene & Cell Therapy
record_format MEDLINE/PubMed
spelling pubmed-68592782019-11-22 RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition Fang, Ting Zhang, Zizheng Sun, Rui Zhu, Lin He, Jingjing Huang, Bei Xiong, Yi Zhu, Xiaolei Mol Ther Nucleic Acids Article 5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research. American Society of Gene & Cell Therapy 2019-10-18 /pmc/articles/PMC6859278/ /pubmed/31726390 http://dx.doi.org/10.1016/j.omtn.2019.10.008 Text en © 2019 The Authors http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Fang, Ting
Zhang, Zizheng
Sun, Rui
Zhu, Lin
He, Jingjing
Huang, Bei
Xiong, Yi
Zhu, Xiaolei
RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_full RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_fullStr RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_full_unstemmed RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_short RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition
title_sort rnam5cpred: prediction of rna 5-methylcytosine sites based on three different kinds of nucleotide composition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859278/
https://www.ncbi.nlm.nih.gov/pubmed/31726390
http://dx.doi.org/10.1016/j.omtn.2019.10.008
work_keys_str_mv AT fangting rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT zhangzizheng rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT sunrui rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT zhulin rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT hejingjing rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT huangbei rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT xiongyi rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition
AT zhuxiaolei rnam5cpredpredictionofrna5methylcytosinesitesbasedonthreedifferentkindsofnucleotidecomposition