Cargando…

Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features

5-Methylcytosine (m5C) is a well-known post-transcriptional modification that plays significant roles in biological processes, such as RNA metabolism, tRNA recognition, and stress responses. Traditional high-throughput techniques on identification of m5C sites are usually time consuming and expensiv...

Descripción completa

Detalles Bibliográficos
Autores principales: Dou, Lijun, Li, Xiaoling, Ding, Hui, Xu, Lei, Xiang, Huaikun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Gene & Cell Therapy 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7340967/
https://www.ncbi.nlm.nih.gov/pubmed/32645685
http://dx.doi.org/10.1016/j.omtn.2020.06.004
Descripción
Sumario:5-Methylcytosine (m5C) is a well-known post-transcriptional modification that plays significant roles in biological processes, such as RNA metabolism, tRNA recognition, and stress responses. Traditional high-throughput techniques on identification of m5C sites are usually time consuming and expensive. In addition, the number of RNA sequences shows explosive growth in the post-genomic era. Thus, machine-learning-based methods are urgently requested to quickly predict RNA m5C modifications with high accuracy. Here, we propose a noval support-vector-machine (SVM)-based tool, called iRNA-m5C_SVM, by combining multiple sequence features to identify m5C sites in Arabidopsis thaliana. Eight kinds of popular feature-extraction methods were first investigated systematically. Then, four well-performing features were incorporated to construct a comprehensive model, including position-specific propensity (PSP) (PSNP, PSDP, and PSTP, associated with frequencies of nucleotides, dinucleotides, and trinucleotides, respectively), nucleotide composition (nucleic acid, di-nucleotide, and tri-nucleotide compositions; NAC, DNC, and TNC, respectively), electron-ion interaction pseudopotentials of trinucleotide (PseEIIPs), and general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-general). Evaluated accuracies over 10-fold cross-validation and independent tests achieved 73.06% and 80.15%, respectively, which showed the best predictive performances in A. thaliana among existing models. It is believed that the proposed model in this work can be a promising alternative for further research on m5C modification sites in plant.