Cargando…

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dabrowski, Michal, Dojer, Norbert, Krystkowiak, Izabella, Kaminska, Bozena, Wilczynski, Bartek
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/ https://www.ncbi.nlm.nih.gov/pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5

_version_	1782372148247003136
author	Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek
author_facet	Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek
author_sort	Dabrowski, Michal
collection	PubMed
description	BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4436866
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44368662015-05-20 Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek BMC Bioinformatics Research Article BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users. BioMed Central 2015-05-01 /pmc/articles/PMC4436866/ /pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5 Text en © Dabrowski et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title	Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_full	Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_fullStr	Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_full_unstemmed	Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_short	Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_sort	optimally choosing pwm motif databases and sequence scanning approaches based on chip-seq data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/ https://www.ncbi.nlm.nih.gov/pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5
work_keys_str_mv	AT dabrowskimichal optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT dojernorbert optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT krystkowiakizabella optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT kaminskabozena optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT wilczynskibartek optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Ejemplares similares