Cargando…

A reexamination of information theory-based methods for DNA-binding site identification

BACKGROUND: Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only...

Descripción completa

Detalles Bibliográficos
Autores principales: Erill, Ivan, O'Neill, Michael C
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2680408/
https://www.ncbi.nlm.nih.gov/pubmed/19210776
http://dx.doi.org/10.1186/1471-2105-10-57
_version_ 1782166952002715648
author Erill, Ivan
O'Neill, Michael C
author_facet Erill, Ivan
O'Neill, Michael C
author_sort Erill, Ivan
collection PubMed
description BACKGROUND: Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. RESULTS: Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. CONCLUSION: We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.
format Text
id pubmed-2680408
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26804082009-05-12 A reexamination of information theory-based methods for DNA-binding site identification Erill, Ivan O'Neill, Michael C BMC Bioinformatics Research Article BACKGROUND: Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. RESULTS: Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. CONCLUSION: We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution. BioMed Central 2009-02-11 /pmc/articles/PMC2680408/ /pubmed/19210776 http://dx.doi.org/10.1186/1471-2105-10-57 Text en Copyright © 2009 Erill and O'Neill; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Erill, Ivan
O'Neill, Michael C
A reexamination of information theory-based methods for DNA-binding site identification
title A reexamination of information theory-based methods for DNA-binding site identification
title_full A reexamination of information theory-based methods for DNA-binding site identification
title_fullStr A reexamination of information theory-based methods for DNA-binding site identification
title_full_unstemmed A reexamination of information theory-based methods for DNA-binding site identification
title_short A reexamination of information theory-based methods for DNA-binding site identification
title_sort reexamination of information theory-based methods for dna-binding site identification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2680408/
https://www.ncbi.nlm.nih.gov/pubmed/19210776
http://dx.doi.org/10.1186/1471-2105-10-57
work_keys_str_mv AT erillivan areexaminationofinformationtheorybasedmethodsfordnabindingsiteidentification
AT oneillmichaelc areexaminationofinformationtheorybasedmethodsfordnabindingsiteidentification
AT erillivan reexaminationofinformationtheorybasedmethodsfordnabindingsiteidentification
AT oneillmichaelc reexaminationofinformationtheorybasedmethodsfordnabindingsiteidentification