Cargando…

Theoretical and empirical quality assessment of transcription factor-binding motifs

Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of p...

Descripción completa

Detalles Bibliográficos
Autores principales: Medina-Rivera, Alejandra, Abreu-Goodger, Cei, Thomas-Chollier, Morgane, Salgado, Heladia, Collado-Vides, Julio, van Helden, Jacques
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035439/
https://www.ncbi.nlm.nih.gov/pubmed/20923783
http://dx.doi.org/10.1093/nar/gkq710
_version_ 1782197767253262336
author Medina-Rivera, Alejandra
Abreu-Goodger, Cei
Thomas-Chollier, Morgane
Salgado, Heladia
Collado-Vides, Julio
van Helden, Jacques
author_facet Medina-Rivera, Alejandra
Abreu-Goodger, Cei
Thomas-Chollier, Morgane
Salgado, Heladia
Collado-Vides, Julio
van Helden, Jacques
author_sort Medina-Rivera, Alejandra
collection PubMed
description Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program ‘matrix-quality’, that combines theoretical and empirical score distributions to assess reliability of PSSMs for predicting TF-binding sites. We applied ‘matrix-quality’ to estimate the predictive capacity of matrices for bacterial, yeast and mouse TFs. The evaluation of matrices from RegulonDB revealed some poorly predictive motifs, and allowed us to quantify the improvements obtained by applying multi-genome motif discovery. Interestingly, the method reveals differences between global and specific regulators. It also highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip (bacterial and yeast TFs), and ChIP–seq and experiments (mouse TFs). The method presented here has many applications, including: selecting reliable motifs before scanning sequences; improving motif collections in TFs databases; evaluating motifs discovered using high-throughput data sets.
format Text
id pubmed-3035439
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-30354392011-02-08 Theoretical and empirical quality assessment of transcription factor-binding motifs Medina-Rivera, Alejandra Abreu-Goodger, Cei Thomas-Chollier, Morgane Salgado, Heladia Collado-Vides, Julio van Helden, Jacques Nucleic Acids Res Computational Biology Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program ‘matrix-quality’, that combines theoretical and empirical score distributions to assess reliability of PSSMs for predicting TF-binding sites. We applied ‘matrix-quality’ to estimate the predictive capacity of matrices for bacterial, yeast and mouse TFs. The evaluation of matrices from RegulonDB revealed some poorly predictive motifs, and allowed us to quantify the improvements obtained by applying multi-genome motif discovery. Interestingly, the method reveals differences between global and specific regulators. It also highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip (bacterial and yeast TFs), and ChIP–seq and experiments (mouse TFs). The method presented here has many applications, including: selecting reliable motifs before scanning sequences; improving motif collections in TFs databases; evaluating motifs discovered using high-throughput data sets. Oxford University Press 2011-02 2010-10-04 /pmc/articles/PMC3035439/ /pubmed/20923783 http://dx.doi.org/10.1093/nar/gkq710 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Medina-Rivera, Alejandra
Abreu-Goodger, Cei
Thomas-Chollier, Morgane
Salgado, Heladia
Collado-Vides, Julio
van Helden, Jacques
Theoretical and empirical quality assessment of transcription factor-binding motifs
title Theoretical and empirical quality assessment of transcription factor-binding motifs
title_full Theoretical and empirical quality assessment of transcription factor-binding motifs
title_fullStr Theoretical and empirical quality assessment of transcription factor-binding motifs
title_full_unstemmed Theoretical and empirical quality assessment of transcription factor-binding motifs
title_short Theoretical and empirical quality assessment of transcription factor-binding motifs
title_sort theoretical and empirical quality assessment of transcription factor-binding motifs
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035439/
https://www.ncbi.nlm.nih.gov/pubmed/20923783
http://dx.doi.org/10.1093/nar/gkq710
work_keys_str_mv AT medinariveraalejandra theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs
AT abreugoodgercei theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs
AT thomascholliermorgane theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs
AT salgadoheladia theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs
AT colladovidesjulio theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs
AT vanheldenjacques theoreticalandempiricalqualityassessmentoftranscriptionfactorbindingmotifs