Cargando…

The Limits of De Novo DNA Motif Discovery

A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Simcha, David, Price, Nathan D., Geman, Donald
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492406/ https://www.ncbi.nlm.nih.gov/pubmed/23144830 http://dx.doi.org/10.1371/journal.pone.0047836

_version_	1782249128759132160
author	Simcha, David Price, Nathan D. Geman, Donald
author_facet	Simcha, David Price, Nathan D. Geman, Donald
author_sort	Simcha, David
collection	PubMed
description	A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.
format	Online Article Text
id	pubmed-3492406
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-34924062012-11-09 The Limits of De Novo DNA Motif Discovery Simcha, David Price, Nathan D. Geman, Donald PLoS One Research Article A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/. Public Library of Science 2012-11-07 /pmc/articles/PMC3492406/ /pubmed/23144830 http://dx.doi.org/10.1371/journal.pone.0047836 Text en © 2012 Simcha et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Simcha, David Price, Nathan D. Geman, Donald The Limits of De Novo DNA Motif Discovery
title	The Limits of De Novo DNA Motif Discovery
title_full	The Limits of De Novo DNA Motif Discovery
title_fullStr	The Limits of De Novo DNA Motif Discovery
title_full_unstemmed	The Limits of De Novo DNA Motif Discovery
title_short	The Limits of De Novo DNA Motif Discovery
title_sort	limits of de novo dna motif discovery
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492406/ https://www.ncbi.nlm.nih.gov/pubmed/23144830 http://dx.doi.org/10.1371/journal.pone.0047836
work_keys_str_mv	AT simchadavid thelimitsofdenovodnamotifdiscovery AT pricenathand thelimitsofdenovodnamotifdiscovery AT gemandonald thelimitsofdenovodnamotifdiscovery AT simchadavid limitsofdenovodnamotifdiscovery AT pricenathand limitsofdenovodnamotifdiscovery AT gemandonald limitsofdenovodnamotifdiscovery

The Limits of De Novo DNA Motif Discovery

Ejemplares similares