Cargando…

Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through b...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abbas, Mostafa M., Mohie-Eldin, Mostafa M., EL-Manzalawy, Yasser
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4372424/ https://www.ncbi.nlm.nih.gov/pubmed/25803493 http://dx.doi.org/10.1371/journal.pone.0119721

_version_	1782363182115848192
author	Abbas, Mostafa M. Mohie-Eldin, Mostafa M. EL-Manzalawy, Yasser
author_facet	Abbas, Mostafa M. Mohie-Eldin, Mostafa M. EL-Manzalawy, Yasser
author_sort	Abbas, Mostafa M.
collection	PubMed
description	As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ (70) promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ (70) promoter prediction methods.
format	Online Article Text
id	pubmed-4372424
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-43724242015-04-04 Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors Abbas, Mostafa M. Mohie-Eldin, Mostafa M. EL-Manzalawy, Yasser PLoS One Research Article As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ (70) promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ (70) promoter prediction methods. Public Library of Science 2015-03-24 /pmc/articles/PMC4372424/ /pubmed/25803493 http://dx.doi.org/10.1371/journal.pone.0119721 Text en © 2015 Abbas et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Abbas, Mostafa M. Mohie-Eldin, Mostafa M. EL-Manzalawy, Yasser Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title	Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title_full	Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title_fullStr	Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title_full_unstemmed	Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title_short	Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors
title_sort	assessing the effects of data selection and representation on the development of reliable e. coli sigma 70 promoter region predictors
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4372424/ https://www.ncbi.nlm.nih.gov/pubmed/25803493 http://dx.doi.org/10.1371/journal.pone.0119721
work_keys_str_mv	AT abbasmostafam assessingtheeffectsofdataselectionandrepresentationonthedevelopmentofreliableecolisigma70promoterregionpredictors AT mohieeldinmostafam assessingtheeffectsofdataselectionandrepresentationonthedevelopmentofreliableecolisigma70promoterregionpredictors AT elmanzalawyyasser assessingtheeffectsofdataselectionandrepresentationonthedevelopmentofreliableecolisigma70promoterregionpredictors

Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors

Ejemplares similares