Cargando…

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Eggeling, Ralf, Roos, Teemu, Myllymäki, Petri, Grosse, Ivo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/ https://www.ncbi.nlm.nih.gov/pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4

_version_	1782400036031692800
author	Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo
author_facet	Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo
author_sort	Eggeling, Ralf
collection	PubMed
description	BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4640111
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46401112015-11-11 Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo BMC Bioinformatics Research Article BACKGROUND: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. RESULTS: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. CONCLUSIONS: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-09 /pmc/articles/PMC4640111/ /pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4 Text en © Eggeling et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Eggeling, Ralf Roos, Teemu Myllymäki, Petri Grosse, Ivo Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title	Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_full	Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_fullStr	Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_full_unstemmed	Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_short	Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data
title_sort	inferring intra-motif dependencies of dna binding sites from chip-seq data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640111/ https://www.ncbi.nlm.nih.gov/pubmed/26552868 http://dx.doi.org/10.1186/s12859-015-0797-4
work_keys_str_mv	AT eggelingralf inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT roosteemu inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT myllymakipetri inferringintramotifdependenciesofdnabindingsitesfromchipseqdata AT grosseivo inferringintramotifdependenciesofdnabindingsitesfromchipseqdata

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Ejemplares similares