Cargando…
Predicting protein-binding regions in RNA using nucleotide profiles and compositions
BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limite...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374631/ https://www.ncbi.nlm.nih.gov/pubmed/28361677 http://dx.doi.org/10.1186/s12918-017-0386-4 |
_version_ | 1782518929556504576 |
---|---|
author | Choi, Daesik Park, Byungkyu Chae, Hanju Lee, Wook Han, Kyungsook |
author_facet | Choi, Daesik Park, Byungkyu Chae, Hanju Lee, Wook Han, Kyungsook |
author_sort | Choi, Daesik |
collection | PubMed |
description | BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. RESULTS: We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. CONCLUSIONS: Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5374631 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53746312017-04-03 Predicting protein-binding regions in RNA using nucleotide profiles and compositions Choi, Daesik Park, Byungkyu Chae, Hanju Lee, Wook Han, Kyungsook BMC Syst Biol Research BACKGROUND: Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. RESULTS: We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. CONCLUSIONS: Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-14 /pmc/articles/PMC5374631/ /pubmed/28361677 http://dx.doi.org/10.1186/s12918-017-0386-4 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Choi, Daesik Park, Byungkyu Chae, Hanju Lee, Wook Han, Kyungsook Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title | Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title_full | Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title_fullStr | Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title_full_unstemmed | Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title_short | Predicting protein-binding regions in RNA using nucleotide profiles and compositions |
title_sort | predicting protein-binding regions in rna using nucleotide profiles and compositions |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374631/ https://www.ncbi.nlm.nih.gov/pubmed/28361677 http://dx.doi.org/10.1186/s12918-017-0386-4 |
work_keys_str_mv | AT choidaesik predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions AT parkbyungkyu predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions AT chaehanju predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions AT leewook predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions AT hankyungsook predictingproteinbindingregionsinrnausingnucleotideprofilesandcompositions |