Cargando…

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites

We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Long, Pengpeng, Zhang, Lu, Huang, Bin, Chen, Quan, Liu, Haiyan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Computational Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7736823/ https://www.ncbi.nlm.nih.gov/pubmed/33264415 http://dx.doi.org/10.1093/nar/gkaa1134

_version_	1783622845598793728
author	Long, Pengpeng Zhang, Lu Huang, Bin Chen, Quan Liu, Haiyan
author_facet	Long, Pengpeng Zhang, Lu Huang, Bin Chen, Quan Liu, Haiyan
author_sort	Long, Pengpeng
collection	PubMed
description	We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information.
format	Online Article Text
id	pubmed-7736823
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-77368232020-12-17 Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites Long, Pengpeng Zhang, Lu Huang, Bin Chen, Quan Liu, Haiyan Nucleic Acids Res Computational Biology We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information. Oxford University Press 2020-12-02 /pmc/articles/PMC7736823/ /pubmed/33264415 http://dx.doi.org/10.1093/nar/gkaa1134 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Computational Biology Long, Pengpeng Zhang, Lu Huang, Bin Chen, Quan Liu, Haiyan Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title	Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title_full	Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title_fullStr	Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title_full_unstemmed	Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title_short	Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
title_sort	integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites
topic	Computational Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7736823/ https://www.ncbi.nlm.nih.gov/pubmed/33264415 http://dx.doi.org/10.1093/nar/gkaa1134
work_keys_str_mv	AT longpengpeng integratinggenomesequenceandstructuraldataforstatisticallearningtopredicttranscriptionfactorbindingsites AT zhanglu integratinggenomesequenceandstructuraldataforstatisticallearningtopredicttranscriptionfactorbindingsites AT huangbin integratinggenomesequenceandstructuraldataforstatisticallearningtopredicttranscriptionfactorbindingsites AT chenquan integratinggenomesequenceandstructuraldataforstatisticallearningtopredicttranscriptionfactorbindingsites AT liuhaiyan integratinggenomesequenceandstructuraldataforstatisticallearningtopredicttranscriptionfactorbindingsites

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites

Ejemplares similares