Cargando…

Prediction of RNA-binding amino acids from protein and RNA sequences

BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequen...

Descripción completa

Detalles Bibliográficos
Autores principales: Choi, Sungwook, Han, Kyungsook
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/
https://www.ncbi.nlm.nih.gov/pubmed/22373313
http://dx.doi.org/10.1186/1471-2105-12-S13-S7
_version_ 1782223615434948608
author Choi, Sungwook
Han, Kyungsook
author_facet Choi, Sungwook
Han, Kyungsook
author_sort Choi, Sungwook
collection PubMed
description BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.
format Online
Article
Text
id pubmed-3278847
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32788472012-02-14 Prediction of RNA-binding amino acids from protein and RNA sequences Choi, Sungwook Han, Kyungsook BMC Bioinformatics Proceedings BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence. BioMed Central 2011-11-30 /pmc/articles/PMC3278847/ /pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7 Text en Copyright ©2011 Choi and Han; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Choi, Sungwook
Han, Kyungsook
Prediction of RNA-binding amino acids from protein and RNA sequences
title Prediction of RNA-binding amino acids from protein and RNA sequences
title_full Prediction of RNA-binding amino acids from protein and RNA sequences
title_fullStr Prediction of RNA-binding amino acids from protein and RNA sequences
title_full_unstemmed Prediction of RNA-binding amino acids from protein and RNA sequences
title_short Prediction of RNA-binding amino acids from protein and RNA sequences
title_sort prediction of rna-binding amino acids from protein and rna sequences
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/
https://www.ncbi.nlm.nih.gov/pubmed/22373313
http://dx.doi.org/10.1186/1471-2105-12-S13-S7
work_keys_str_mv AT choisungwook predictionofrnabindingaminoacidsfromproteinandrnasequences
AT hankyungsook predictionofrnabindingaminoacidsfromproteinandrnasequences