Cargando…
Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than t...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/ https://www.ncbi.nlm.nih.gov/pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4 |
_version_ | 1782400036031692800 |
---|---|
author | Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo |
author_facet | Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo |
author_sort | Eggeling, Ralf |
collection | PubMed |
description | BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4640111 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-46401112015-11-11 Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo BMC Bioinformatics Research Article BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-09 /pmc/articles/PMC4640111/ /pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4 Text en © Eggeling et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title_full | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title_fullStr | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title_full_unstemmed | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title_short | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data |
title_sort | inferring intra-motif dependencies of dna binding sites from chip-seq data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/ https://www.ncbi.nlm.nih.gov/pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4 |
work_keys_str_mv | AT eggelingralf inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT roosteemu inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT myllymakipetri inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT grosseivo inferringintramotifdependenciesofdnabindingsitesfromchipseqdata |