Cargando…

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as...

Descripción completa

Detalles Bibliográficos
Autores principales:	Walia, Rasna R, Caragea, Cornelia, Lewis, Benjamin A, Towfic, Fadi, Terribilini, Michael, El-Manzalawy, Yasser, Dobbs, Drena, Honavar, Vasant
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3490755/ https://www.ncbi.nlm.nih.gov/pubmed/22574904 http://dx.doi.org/10.1186/1471-2105-13-89

_version_	1782248865113571328
author	Walia, Rasna R Caragea, Cornelia Lewis, Benjamin A Towfic, Fadi Terribilini, Michael El-Manzalawy, Yasser Dobbs, Drena Honavar, Vasant
author_facet	Walia, Rasna R Caragea, Cornelia Lewis, Benjamin A Towfic, Fadi Terribilini, Michael El-Manzalawy, Yasser Dobbs, Drena Honavar, Vasant
author_sort	Walia, Rasna R
collection	PubMed
description	BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ‘code’ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
format	Online Article Text
id	pubmed-3490755
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34907552012-11-08 Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art Walia, Rasna R Caragea, Cornelia Lewis, Benjamin A Towfic, Fadi Terribilini, Michael El-Manzalawy, Yasser Dobbs, Drena Honavar, Vasant BMC Bioinformatics Research Article BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ‘code’ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons. BioMed Central 2012-05-10 /pmc/articles/PMC3490755/ /pubmed/22574904 http://dx.doi.org/10.1186/1471-2105-13-89 Text en Copyright ©2012 Walia et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Walia, Rasna R Caragea, Cornelia Lewis, Benjamin A Towfic, Fadi Terribilini, Michael El-Manzalawy, Yasser Dobbs, Drena Honavar, Vasant Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title	Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title_full	Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title_fullStr	Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title_full_unstemmed	Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title_short	Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
title_sort	protein-rna interface residue prediction using machine learning: an assessment of the state of the art
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3490755/ https://www.ncbi.nlm.nih.gov/pubmed/22574904 http://dx.doi.org/10.1186/1471-2105-13-89
work_keys_str_mv	AT waliarasnar proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT carageacornelia proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT lewisbenjamina proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT towficfadi proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT terribilinimichael proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT elmanzalawyyasser proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT dobbsdrena proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart AT honavarvasant proteinrnainterfaceresiduepredictionusingmachinelearninganassessmentofthestateoftheart

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

Ejemplares similares