Cargando…
Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/ https://www.ncbi.nlm.nih.gov/pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5 |
_version_ | 1782372148247003136 |
---|---|
author | Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek |
author_facet | Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek |
author_sort | Dabrowski, Michal |
collection | PubMed |
description | BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4436866 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-44368662015-05-20 Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek BMC Bioinformatics Research Article BACKGROUND: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users. BioMed Central 2015-05-01 /pmc/articles/PMC4436866/ /pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5 Text en © Dabrowski et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Dabrowski, Michal Dojer, Norbert Krystkowiak, Izabella Kaminska, Bozena Wilczynski, Bartek Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title | Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title_full | Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title_fullStr | Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title_full_unstemmed | Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title_short | Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data |
title_sort | optimally choosing pwm motif databases and sequence scanning approaches based on chip-seq data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4436866/ https://www.ncbi.nlm.nih.gov/pubmed/25927199 http://dx.doi.org/10.1186/s12859-015-0573-5 |
work_keys_str_mv | AT dabrowskimichal optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT dojernorbert optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT krystkowiakizabella optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT kaminskabozena optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata AT wilczynskibartek optimallychoosingpwmmotifdatabasesandsequencescanningapproachesbasedonchipseqdata |