Cargando…
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a sp...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836277/ https://www.ncbi.nlm.nih.gov/pubmed/36520922 http://dx.doi.org/10.1371/journal.pcbi.1010779 |
_version_ | 1784868829617192960 |
---|---|
author | Wang, Yansong Hou, Zilong Yang, Yuning Wong, Ka-chun Li, Xiangtao |
author_facet | Wang, Yansong Hou, Zilong Yang, Yuning Wong, Ka-chun Li, Xiangtao |
author_sort | Wang, Yansong |
collection | PubMed |
description | Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework. |
format | Online Article Text |
id | pubmed-9836277 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-98362772023-01-13 Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework Wang, Yansong Hou, Zilong Yang, Yuning Wong, Ka-chun Li, Xiangtao PLoS Comput Biol Research Article Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework. Public Library of Science 2022-12-15 /pmc/articles/PMC9836277/ /pubmed/36520922 http://dx.doi.org/10.1371/journal.pcbi.1010779 Text en © 2022 Wang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Wang, Yansong Hou, Zilong Yang, Yuning Wong, Ka-chun Li, Xiangtao Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title | Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title_full | Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title_fullStr | Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title_full_unstemmed | Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title_short | Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework |
title_sort | genome-wide identification and characterization of dna enhancers with a stacked multivariate fusion framework |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836277/ https://www.ncbi.nlm.nih.gov/pubmed/36520922 http://dx.doi.org/10.1371/journal.pcbi.1010779 |
work_keys_str_mv | AT wangyansong genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework AT houzilong genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework AT yangyuning genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework AT wongkachun genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework AT lixiangtao genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework |