Cargando…

Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements

BACKGROUND: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary...

Descripción completa

Detalles Bibliográficos
Autores principales:	Creanza, Teresa M, Horner, David S, D'Addabbo, Annarita, Maglietta, Rosalia, Mignone, Flavio, Ancona, Nicola, Pesole, Graziano
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2697643/ https://www.ncbi.nlm.nih.gov/pubmed/19534745 http://dx.doi.org/10.1186/1471-2105-10-S6-S2

_version_	1782168346648641536
author	Creanza, Teresa M Horner, David S D'Addabbo, Annarita Maglietta, Rosalia Mignone, Flavio Ancona, Nicola Pesole, Graziano
author_facet	Creanza, Teresa M Horner, David S D'Addabbo, Annarita Maglietta, Rosalia Mignone, Flavio Ancona, Nicola Pesole, Graziano
author_sort	Creanza, Teresa M
collection	PubMed
description	BACKGROUND: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths. RESULTS: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value ≤ 0.05). CONCLUSION: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences – this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.
format	Text
id	pubmed-2697643
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26976432009-06-16 Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements Creanza, Teresa M Horner, David S D'Addabbo, Annarita Maglietta, Rosalia Mignone, Flavio Ancona, Nicola Pesole, Graziano BMC Bioinformatics Proceedings BACKGROUND: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths. RESULTS: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value ≤ 0.05). CONCLUSION: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences – this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code. BioMed Central 2009-06-16 /pmc/articles/PMC2697643/ /pubmed/19534745 http://dx.doi.org/10.1186/1471-2105-10-S6-S2 Text en Copyright © 2009 Creanza et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Creanza, Teresa M Horner, David S D'Addabbo, Annarita Maglietta, Rosalia Mignone, Flavio Ancona, Nicola Pesole, Graziano Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title	Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title_full	Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title_fullStr	Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title_full_unstemmed	Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title_short	Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
title_sort	statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2697643/ https://www.ncbi.nlm.nih.gov/pubmed/19534745 http://dx.doi.org/10.1186/1471-2105-10-S6-S2
work_keys_str_mv	AT creanzateresam statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT hornerdavids statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT daddabboannarita statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT magliettarosalia statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT mignoneflavio statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT anconanicola statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements AT pesolegraziano statisticalassessmentofdiscriminativefeaturesforproteincodingandnoncodingcrossspeciesconservedsequenceelements

Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements

Ejemplares similares