Cargando…

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LO...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Meng, Huang, Lichao, Huang, Haiping, Tang, Hui, Zhang, Nan, Yang, Huanming, Wu, Jihong, Mu, Feng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9371931/ https://www.ncbi.nlm.nih.gov/pubmed/35536244 http://dx.doi.org/10.1093/nar/gkac326

_version_	1784767270264766464
author	Yang, Meng Huang, Lichao Huang, Haiping Tang, Hui Zhang, Nan Yang, Huanming Wu, Jihong Mu, Feng
author_facet	Yang, Meng Huang, Lichao Huang, Haiping Tang, Hui Zhang, Nan Yang, Huanming Wu, Jihong Mu, Feng
author_sort	Yang, Meng
collection	PubMed
description	Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.
format	Online Article Text
id	pubmed-9371931
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-93719312022-08-12 Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution Yang, Meng Huang, Lichao Huang, Haiping Tang, Hui Zhang, Nan Yang, Huanming Wu, Jihong Mu, Feng Nucleic Acids Res Methods Online Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution. Oxford University Press 2022-05-10 /pmc/articles/PMC9371931/ /pubmed/35536244 http://dx.doi.org/10.1093/nar/gkac326 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Methods Online Yang, Meng Huang, Lichao Huang, Haiping Tang, Hui Zhang, Nan Yang, Huanming Wu, Jihong Mu, Feng Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title_full	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title_fullStr	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title_full_unstemmed	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title_short	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
title_sort	integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9371931/ https://www.ncbi.nlm.nih.gov/pubmed/35536244 http://dx.doi.org/10.1093/nar/gkac326
work_keys_str_mv	AT yangmeng integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT huanglichao integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT huanghaiping integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT tanghui integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT zhangnan integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT yanghuanming integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT wujihong integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution AT mufeng integratingconvolutionandselfattentionimproveslanguagemodelofhumangenomeforinterpretingnoncodingregionsatbaseresolution

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Ejemplares similares