Cargando…
Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239187/ https://www.ncbi.nlm.nih.gov/pubmed/21821659 http://dx.doi.org/10.1093/nar/gkr621 |
_version_ | 1782219138937126912 |
---|---|
author | Kazemian, Majid Zhu, Qiyun Halfon, Marc S. Sinha, Saurabh |
author_facet | Kazemian, Majid Zhu, Qiyun Halfon, Marc S. Sinha, Saurabh |
author_sort | Kazemian, Majid |
collection | PubMed |
description | Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. |
format | Online Article Text |
id | pubmed-3239187 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-32391872011-12-16 Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison Kazemian, Majid Zhu, Qiyun Halfon, Marc S. Sinha, Saurabh Nucleic Acids Res Computational Biology Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. Oxford University Press 2011-12 2011-08-05 /pmc/articles/PMC3239187/ /pubmed/21821659 http://dx.doi.org/10.1093/nar/gkr621 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Computational Biology Kazemian, Majid Zhu, Qiyun Halfon, Marc S. Sinha, Saurabh Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title | Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title_full | Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title_fullStr | Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title_full_unstemmed | Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title_short | Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison |
title_sort | improved accuracy of supervised crm discovery with interpolated markov models and cross-species comparison |
topic | Computational Biology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239187/ https://www.ncbi.nlm.nih.gov/pubmed/21821659 http://dx.doi.org/10.1093/nar/gkr621 |
work_keys_str_mv | AT kazemianmajid improvedaccuracyofsupervisedcrmdiscoverywithinterpolatedmarkovmodelsandcrossspeciescomparison AT zhuqiyun improvedaccuracyofsupervisedcrmdiscoverywithinterpolatedmarkovmodelsandcrossspeciescomparison AT halfonmarcs improvedaccuracyofsupervisedcrmdiscoverywithinterpolatedmarkovmodelsandcrossspeciescomparison AT sinhasaurabh improvedaccuracyofsupervisedcrmdiscoverywithinterpolatedmarkovmodelsandcrossspeciescomparison |