Cargando…

Prediction of RNA-binding amino acids from protein and RNA sequences

BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Choi, Sungwook, Han, Kyungsook
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/ https://www.ncbi.nlm.nih.gov/pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7

_version_	1782223615434948608
author	Choi, Sungwook Han, Kyungsook
author_facet	Choi, Sungwook Han, Kyungsook
author_sort	Choi, Sungwook
collection	PubMed
description	BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.
format	Online Article Text
id	pubmed-3278847
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-32788472012-02-14 Prediction of RNA-binding amino acids from protein and RNA sequences Choi, Sungwook Han, Kyungsook BMC Bioinformatics Proceedings BACKGROUND: Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules. RESULTS: We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others. CONCLUSIONS: The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence. BioMed Central 2011-11-30 /pmc/articles/PMC3278847/ /pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7 Text en Copyright ©2011 Choi and Han; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Choi, Sungwook Han, Kyungsook Prediction of RNA-binding amino acids from protein and RNA sequences
title	Prediction of RNA-binding amino acids from protein and RNA sequences
title_full	Prediction of RNA-binding amino acids from protein and RNA sequences
title_fullStr	Prediction of RNA-binding amino acids from protein and RNA sequences
title_full_unstemmed	Prediction of RNA-binding amino acids from protein and RNA sequences
title_short	Prediction of RNA-binding amino acids from protein and RNA sequences
title_sort	prediction of rna-binding amino acids from protein and rna sequences
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278847/ https://www.ncbi.nlm.nih.gov/pubmed/22373313 http://dx.doi.org/10.1186/1471-2105-12-S13-S7
work_keys_str_mv	AT choisungwook predictionofrnabindingaminoacidsfromproteinandrnasequences AT hankyungsook predictionofrnabindingaminoacidsfromproteinandrnasequences

Prediction of RNA-binding amino acids from protein and RNA sequences

Ejemplares similares