Cargando…

Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework

Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a sp...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yansong, Hou, Zilong, Yang, Yuning, Wong, Ka-chun, Li, Xiangtao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836277/
https://www.ncbi.nlm.nih.gov/pubmed/36520922
http://dx.doi.org/10.1371/journal.pcbi.1010779
_version_ 1784868829617192960
author Wang, Yansong
Hou, Zilong
Yang, Yuning
Wong, Ka-chun
Li, Xiangtao
author_facet Wang, Yansong
Hou, Zilong
Yang, Yuning
Wong, Ka-chun
Li, Xiangtao
author_sort Wang, Yansong
collection PubMed
description Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.
format Online
Article
Text
id pubmed-9836277
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-98362772023-01-13 Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework Wang, Yansong Hou, Zilong Yang, Yuning Wong, Ka-chun Li, Xiangtao PLoS Comput Biol Research Article Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework. Public Library of Science 2022-12-15 /pmc/articles/PMC9836277/ /pubmed/36520922 http://dx.doi.org/10.1371/journal.pcbi.1010779 Text en © 2022 Wang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Wang, Yansong
Hou, Zilong
Yang, Yuning
Wong, Ka-chun
Li, Xiangtao
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title_full Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title_fullStr Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title_full_unstemmed Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title_short Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework
title_sort genome-wide identification and characterization of dna enhancers with a stacked multivariate fusion framework
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836277/
https://www.ncbi.nlm.nih.gov/pubmed/36520922
http://dx.doi.org/10.1371/journal.pcbi.1010779
work_keys_str_mv AT wangyansong genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework
AT houzilong genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework
AT yangyuning genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework
AT wongkachun genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework
AT lixiangtao genomewideidentificationandcharacterizationofdnaenhancerswithastackedmultivariatefusionframework