Cargando…

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than t...

Descripción completa

Detalles Bibliográficos
Autores principales: Eggeling, Ralf, Roos, Teemu, Myllymäki, Petri, Grosse, Ivo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/
https://www.ncbi.nlm.nih.gov/pubmed/26552868
http://dx.doi.org/10.1186/s12859-015-0797-4
_version_ 1782400036031692800
author Eggeling, Ralf
Roos, Teemu
Myllymäki, Petri
Grosse, Ivo
author_facet Eggeling, Ralf
Roos, Teemu
Myllymäki, Petri
Grosse, Ivo
author_sort Eggeling, Ralf
collection PubMed
description BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4640111
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46401112015-11-11 Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo BMC Bioinformatics Research Article BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-09 /pmc/articles/PMC4640111/ /pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4 Text en © Eggeling et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Eggeling, Ralf
Roos, Teemu
Myllymäki, Petri
Grosse, Ivo
Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_full Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_fullStr Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_full_unstemmed Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_short Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_sort inferring intra-motif dependencies of dna binding sites from chip-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/
https://www.ncbi.nlm.nih.gov/pubmed/26552868
http://dx.doi.org/10.1186/s12859-015-0797-4
work_keys_str_mv AT eggelingralf inferringintramotifdependenciesofdnabindingsitesfromchipseqdata
AT roosteemu inferringintramotifdependenciesofdnabindingsitesfromchipseqdata
AT myllymakipetri inferringintramotifdependenciesofdnabindingsitesfromchipseqdata
AT grosseivo inferringintramotifdependenciesofdnabindingsitesfromchipseqdata