Cargando…

Towards a theoretical understanding of false positives in DNA motif finding

BACKGROUND: Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room f...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zia, Amin, Moses, Alan M
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436861/ https://www.ncbi.nlm.nih.gov/pubmed/22738169 http://dx.doi.org/10.1186/1471-2105-13-151

_version_	1782242713865814016
author	Zia, Amin Moses, Alan M
author_facet	Zia, Amin Moses, Alan M
author_sort	Zia, Amin
collection	PubMed
description	BACKGROUND: Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios. RESULTS: Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula. CONCLUSIONS: We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology.
format	Online Article Text
id	pubmed-3436861
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34368612012-09-11 Towards a theoretical understanding of false positives in DNA motif finding Zia, Amin Moses, Alan M BMC Bioinformatics Research Article BACKGROUND: Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios. RESULTS: Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula. CONCLUSIONS: We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology. BioMed Central 2012-06-27 /pmc/articles/PMC3436861/ /pubmed/22738169 http://dx.doi.org/10.1186/1471-2105-13-151 Text en Copyright ©2012 Zia and Moses; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Zia, Amin Moses, Alan M Towards a theoretical understanding of false positives in DNA motif finding
title	Towards a theoretical understanding of false positives in DNA motif finding
title_full	Towards a theoretical understanding of false positives in DNA motif finding
title_fullStr	Towards a theoretical understanding of false positives in DNA motif finding
title_full_unstemmed	Towards a theoretical understanding of false positives in DNA motif finding
title_short	Towards a theoretical understanding of false positives in DNA motif finding
title_sort	towards a theoretical understanding of false positives in dna motif finding
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436861/ https://www.ncbi.nlm.nih.gov/pubmed/22738169 http://dx.doi.org/10.1186/1471-2105-13-151
work_keys_str_mv	AT ziaamin towardsatheoreticalunderstandingoffalsepositivesindnamotiffinding AT mosesalanm towardsatheoreticalunderstandingoffalsepositivesindnamotiffinding

Towards a theoretical understanding of false positives in DNA motif finding

Ejemplares similares