Cargando…

Stability selection for regression-based models of transcription factor–DNA binding specificity

Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For thi...

Descripción completa

Detalles Bibliográficos
Autores principales: Mordelet, Fantine, Horton, John, Hartemink, Alexander J., Engelhardt, Barbara E., Gordân, Raluca
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694650/
https://www.ncbi.nlm.nih.gov/pubmed/23812975
http://dx.doi.org/10.1093/bioinformatics/btt221
_version_ 1782274881369407488
author Mordelet, Fantine
Horton, John
Hartemink, Alexander J.
Engelhardt, Barbara E.
Gordân, Raluca
author_facet Mordelet, Fantine
Horton, John
Hartemink, Alexander J.
Engelhardt, Barbara E.
Gordân, Raluca
author_sort Mordelet, Fantine
collection PubMed
description Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF–DNA binding specificity, trained using high resolution in vitro data from custom protein-binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max and Mad2) in their native genomic context. These high-throughput quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF–DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar position weight matrices, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step toward better sequence-based models of individual TF–DNA binding specificity. Availability: Our code is available at http://genome.duke.edu/labs/gordan/ISMB2013. The PBM data used in this article are available in the Gene Expression Omnibus under accession number GSE47026. Contact: raluca.gordan@duke.edu
format Online
Article
Text
id pubmed-3694650
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-36946502013-06-27 Stability selection for regression-based models of transcription factor–DNA binding specificity Mordelet, Fantine Horton, John Hartemink, Alexander J. Engelhardt, Barbara E. Gordân, Raluca Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF–DNA binding specificity, trained using high resolution in vitro data from custom protein-binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max and Mad2) in their native genomic context. These high-throughput quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF–DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar position weight matrices, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step toward better sequence-based models of individual TF–DNA binding specificity. Availability: Our code is available at http://genome.duke.edu/labs/gordan/ISMB2013. The PBM data used in this article are available in the Gene Expression Omnibus under accession number GSE47026. Contact: raluca.gordan@duke.edu Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3694650/ /pubmed/23812975 http://dx.doi.org/10.1093/bioinformatics/btt221 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
Mordelet, Fantine
Horton, John
Hartemink, Alexander J.
Engelhardt, Barbara E.
Gordân, Raluca
Stability selection for regression-based models of transcription factor–DNA binding specificity
title Stability selection for regression-based models of transcription factor–DNA binding specificity
title_full Stability selection for regression-based models of transcription factor–DNA binding specificity
title_fullStr Stability selection for regression-based models of transcription factor–DNA binding specificity
title_full_unstemmed Stability selection for regression-based models of transcription factor–DNA binding specificity
title_short Stability selection for regression-based models of transcription factor–DNA binding specificity
title_sort stability selection for regression-based models of transcription factor–dna binding specificity
topic Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694650/
https://www.ncbi.nlm.nih.gov/pubmed/23812975
http://dx.doi.org/10.1093/bioinformatics/btt221
work_keys_str_mv AT mordeletfantine stabilityselectionforregressionbasedmodelsoftranscriptionfactordnabindingspecificity
AT hortonjohn stabilityselectionforregressionbasedmodelsoftranscriptionfactordnabindingspecificity
AT harteminkalexanderj stabilityselectionforregressionbasedmodelsoftranscriptionfactordnabindingspecificity
AT engelhardtbarbarae stabilityselectionforregressionbasedmodelsoftranscriptionfactordnabindingspecificity
AT gordanraluca stabilityselectionforregressionbasedmodelsoftranscriptionfactordnabindingspecificity