Cargando…

Predicting transcription factor binding using ensemble random forest models

Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Behjati Ardakani, Fatemeh, Schmidt, Florian, Schulz, Marcel H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6823902/ https://www.ncbi.nlm.nih.gov/pubmed/31723409 http://dx.doi.org/10.12688/f1000research.16200.2

_version_	1783464617325887488
author	Behjati Ardakani, Fatemeh Schmidt, Florian Schulz, Marcel H.
author_facet	Behjati Ardakani, Fatemeh Schmidt, Florian Schulz, Marcel H.
author_sort	Behjati Ardakani, Fatemeh
collection	PubMed
description	Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs). Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups. Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier built based upon data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal. Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697).
format	Online Article Text
id	pubmed-6823902
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-68239022019-11-12 Predicting transcription factor binding using ensemble random forest models Behjati Ardakani, Fatemeh Schmidt, Florian Schulz, Marcel H. F1000Res Research Article Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs). Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups. Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier built based upon data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal. Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697). F1000 Research Limited 2019-09-02 /pmc/articles/PMC6823902/ /pubmed/31723409 http://dx.doi.org/10.12688/f1000research.16200.2 Text en Copyright: © 2019 Behjati Ardakani F et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Behjati Ardakani, Fatemeh Schmidt, Florian Schulz, Marcel H. Predicting transcription factor binding using ensemble random forest models
title	Predicting transcription factor binding using ensemble random forest models
title_full	Predicting transcription factor binding using ensemble random forest models
title_fullStr	Predicting transcription factor binding using ensemble random forest models
title_full_unstemmed	Predicting transcription factor binding using ensemble random forest models
title_short	Predicting transcription factor binding using ensemble random forest models
title_sort	predicting transcription factor binding using ensemble random forest models
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6823902/ https://www.ncbi.nlm.nih.gov/pubmed/31723409 http://dx.doi.org/10.12688/f1000research.16200.2
work_keys_str_mv	AT behjatiardakanifatemeh predictingtranscriptionfactorbindingusingensemblerandomforestmodels AT schmidtflorian predictingtranscriptionfactorbindingusingensemblerandomforestmodels AT schulzmarcelh predictingtranscriptionfactorbindingusingensemblerandomforestmodels

Predicting transcription factor binding using ensemble random forest models

Ejemplares similares