Cargando…

Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

BACKGROUND: Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic facto...

Descripción completa

Detalles Bibliográficos
Autores principales: Girgis, Hani Z, Ovcharenko, Ivan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3359238/
https://www.ncbi.nlm.nih.gov/pubmed/22313678
http://dx.doi.org/10.1186/1471-2105-13-25
_version_ 1782233842664341504
author Girgis, Hani Z
Ovcharenko, Ivan
author_facet Girgis, Hani Z
Ovcharenko, Ivan
author_sort Girgis, Hani Z
collection PubMed
description BACKGROUND: Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed. RESULTS: We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4(+ )T cells. On several data sets, the system achieved 99% specificity. CONCLUSION: These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation.
format Online
Article
Text
id pubmed-3359238
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33592382012-06-01 Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs Girgis, Hani Z Ovcharenko, Ivan BMC Bioinformatics Research Article BACKGROUND: Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed. RESULTS: We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4(+ )T cells. On several data sets, the system achieved 99% specificity. CONCLUSION: These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation. BioMed Central 2012-02-07 /pmc/articles/PMC3359238/ /pubmed/22313678 http://dx.doi.org/10.1186/1471-2105-13-25 Text en Copyright ©2012 Girgis and Ovcharenko; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Girgis, Hani Z
Ovcharenko, Ivan
Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_full Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_fullStr Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_full_unstemmed Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_short Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_sort predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3359238/
https://www.ncbi.nlm.nih.gov/pubmed/22313678
http://dx.doi.org/10.1186/1471-2105-13-25
work_keys_str_mv AT girgishaniz predictingtissuespecificcisregulatorymodulesinthehumangenomeusingpairsofcooccurringmotifs
AT ovcharenkoivan predictingtissuespecificcisregulatorymodulesinthehumangenomeusingpairsofcooccurringmotifs