Cargando…

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of...

Descripción completa

Detalles Bibliográficos
Autores principales: Dabrowski, Michal, Dojer, Norbert, Krystkowiak, Izabella, Kaminska, Bozena, Wilczynski, Bartek
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/
https://www.ncbi.nlm.nih.gov/pubmed/25927199
http://dx.doi.org/10.1186/s12859-015-0573-5
_version_ 1782372148247003136
author Dabrowski, Michal
Dojer, Norbert
Krystkowiak, Izabella
Kaminska, Bozena
Wilczynski, Bartek
author_facet Dabrowski, Michal
Dojer, Norbert
Krystkowiak, Izabella
Kaminska, Bozena
Wilczynski, Bartek
author_sort Dabrowski, Michal
collection PubMed
description BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4436866
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44368662015-05-20 Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek BMC Bioinformatics Research Article BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users. BioMed Central 2015-05-01 /pmc/articles/PMC4436866/ /pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5 Text en © Dabrowski et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Dabrowski, Michal
Dojer, Norbert
Krystkowiak, Izabella
Kaminska, Bozena
Wilczynski, Bartek
Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_full Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_fullStr Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_full_unstemmed Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_short Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
title_sort optimally choosing pwm motif databases and sequence scanning approaches based on chip-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/
https://www.ncbi.nlm.nih.gov/pubmed/25927199
http://dx.doi.org/10.1186/s12859-015-0573-5
work_keys_str_mv AT dabrowskimichal optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata
AT dojernorbert optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata
AT krystkowiakizabella optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata
AT kaminskabozena optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata
AT wilczynskibartek optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata