Cargando…
Prediction of RNA-binding amino acids from protein and RNA sequences
BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequen...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/ https://www.ncbi.nlm.nih.gov/pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7 |
_version_ | 1782223615434948608 |
---|---|
author | Choi, Sungwook Han, Kyungsook |
author_facet | Choi, Sungwook Han, Kyungsook |
author_sort | Choi, Sungwook |
collection | PubMed |
description | BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence. |
format | Online Article Text |
id | pubmed-3278847 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-32788472012-02-14 Prediction of RNA-binding amino acids from protein and RNA sequences Choi, Sungwook Han, Kyungsook BMC Bioinformatics Proceedings BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence. BioMed Central 2011-11-30 /pmc/articles/PMC3278847/ /pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7 Text en Copyright ©2011 Choi and Han; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Choi, Sungwook Han, Kyungsook Prediction of RNA-binding amino acids from protein and RNA sequences |
title | Prediction of RNA-binding amino acids from protein and RNA sequences |
title_full | Prediction of RNA-binding amino acids from protein and RNA sequences |
title_fullStr | Prediction of RNA-binding amino acids from protein and RNA sequences |
title_full_unstemmed | Prediction of RNA-binding amino acids from protein and RNA sequences |
title_short | Prediction of RNA-binding amino acids from protein and RNA sequences |
title_sort | prediction of rna-binding amino acids from protein and rna sequences |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/ https://www.ncbi.nlm.nih.gov/pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7 |
work_keys_str_mv | AT choisungwook predictionofrnabindingaminoacidsfromproteinandrnasequences AT hankyungsook predictionofrnabindingaminoacidsfromproteinandrnasequences |