Cargando…

Learning and interpreting the gene regulatory grammar in a deep learning framework

Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial b...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Ling, Capra, John A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660921/
https://www.ncbi.nlm.nih.gov/pubmed/33137083
http://dx.doi.org/10.1371/journal.pcbi.1008334
_version_ 1783609112979832832
author Chen, Ling
Capra, John A.
author_facet Chen, Ling
Capra, John A.
author_sort Chen, Ling
collection PubMed
description Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
format Online
Article
Text
id pubmed-7660921
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-76609212020-11-18 Learning and interpreting the gene regulatory grammar in a deep learning framework Chen, Ling Capra, John A. PLoS Comput Biol Research Article Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task. Public Library of Science 2020-11-02 /pmc/articles/PMC7660921/ /pubmed/33137083 http://dx.doi.org/10.1371/journal.pcbi.1008334 Text en © 2020 Chen, Capra http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Chen, Ling
Capra, John A.
Learning and interpreting the gene regulatory grammar in a deep learning framework
title Learning and interpreting the gene regulatory grammar in a deep learning framework
title_full Learning and interpreting the gene regulatory grammar in a deep learning framework
title_fullStr Learning and interpreting the gene regulatory grammar in a deep learning framework
title_full_unstemmed Learning and interpreting the gene regulatory grammar in a deep learning framework
title_short Learning and interpreting the gene regulatory grammar in a deep learning framework
title_sort learning and interpreting the gene regulatory grammar in a deep learning framework
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660921/
https://www.ncbi.nlm.nih.gov/pubmed/33137083
http://dx.doi.org/10.1371/journal.pcbi.1008334
work_keys_str_mv AT chenling learningandinterpretingthegeneregulatorygrammarinadeeplearningframework
AT caprajohna learningandinterpretingthegeneregulatorygrammarinadeeplearningframework