Cargando…

Representative transcript sets for evaluating a translational initiation sites predictor

BACKGROUND: Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zeng, Jia, Alhajj, Reda, Demetrick, Douglas J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2712473/ https://www.ncbi.nlm.nih.gov/pubmed/19573244 http://dx.doi.org/10.1186/1471-2105-10-206

_version_	1782169493067268096
author	Zeng, Jia Alhajj, Reda Demetrick, Douglas J
author_facet	Zeng, Jia Alhajj, Reda Demetrick, Douglas J
author_sort	Zeng, Jia
collection	PubMed
description	BACKGROUND: Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens. RESULTS: In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections. CONCLUSION: The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.
format	Text
id	pubmed-2712473
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-27124732009-07-18 Representative transcript sets for evaluating a translational initiation sites predictor Zeng, Jia Alhajj, Reda Demetrick, Douglas J BMC Bioinformatics Research Article BACKGROUND: Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens. RESULTS: In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections. CONCLUSION: The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure. BioMed Central 2009-07-02 /pmc/articles/PMC2712473/ /pubmed/19573244 http://dx.doi.org/10.1186/1471-2105-10-206 Text en Copyright © 2009 Zeng et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Zeng, Jia Alhajj, Reda Demetrick, Douglas J Representative transcript sets for evaluating a translational initiation sites predictor
title	Representative transcript sets for evaluating a translational initiation sites predictor
title_full	Representative transcript sets for evaluating a translational initiation sites predictor
title_fullStr	Representative transcript sets for evaluating a translational initiation sites predictor
title_full_unstemmed	Representative transcript sets for evaluating a translational initiation sites predictor
title_short	Representative transcript sets for evaluating a translational initiation sites predictor
title_sort	representative transcript sets for evaluating a translational initiation sites predictor
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2712473/ https://www.ncbi.nlm.nih.gov/pubmed/19573244 http://dx.doi.org/10.1186/1471-2105-10-206
work_keys_str_mv	AT zengjia representativetranscriptsetsforevaluatingatranslationalinitiationsitespredictor AT alhajjreda representativetranscriptsetsforevaluatingatranslationalinitiationsitespredictor AT demetrickdouglasj representativetranscriptsetsforevaluatingatranslationalinitiationsitespredictor

Representative transcript sets for evaluating a translational initiation sites predictor

Ejemplares similares