Cargando…

Predicting protein-binding regions in RNA using nucleotide profiles and compositions

BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limite...

Descripción completa

Detalles Bibliográficos
Autores principales: Choi, Daesik, Park, Byungkyu, Chae, Hanju, Lee, Wook, Han, Kyungsook
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374631/
https://www.ncbi.nlm.nih.gov/pubmed/28361677
http://dx.doi.org/10.1186/s12918-017-0386-4
_version_ 1782518929556504576
author Choi, Daesik
Park, Byungkyu
Chae, Hanju
Lee, Wook
Han, Kyungsook
author_facet Choi, Daesik
Park, Byungkyu
Chae, Hanju
Lee, Wook
Han, Kyungsook
author_sort Choi, Daesik
collection PubMed
description BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. RESULTS: We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. CONCLUSIONS: Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5374631
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53746312017-04-03 Predicting protein-binding regions in RNA using nucleotide profiles and compositions Choi, Daesik Park, Byungkyu Chae, Hanju Lee, Wook Han, Kyungsook BMC Syst Biol Research BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. RESULTS: We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. CONCLUSIONS: Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-14 /pmc/articles/PMC5374631/ /pubmed/28361677 http://dx.doi.org/10.1186/s12918-017-0386-4 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Choi, Daesik
Park, Byungkyu
Chae, Hanju
Lee, Wook
Han, Kyungsook
Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title_full Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title_fullStr Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title_full_unstemmed Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title_short Predicting protein-binding regions in RNA using nucleotide profiles and compositions
title_sort predicting protein-binding regions in rna using nucleotide profiles and compositions
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374631/
https://www.ncbi.nlm.nih.gov/pubmed/28361677
http://dx.doi.org/10.1186/s12918-017-0386-4
work_keys_str_mv AT choidaesik predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions
AT parkbyungkyu predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions
AT chaehanju predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions
AT leewook predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions
AT hankyungsook predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions