Cargando…

DNA language models are powerful predictors of genome-wide variant effects

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need f...

Descripción completa

Detalles Bibliográficos
Autores principales: Benegas, Gonzalo, Batra, Sanjit Singh, Song, Yun S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10622914/
https://www.ncbi.nlm.nih.gov/pubmed/37883436
http://dx.doi.org/10.1073/pnas.2311219120
_version_ 1785130643679608832
author Benegas, Gonzalo
Batra, Sanjit Singh
Song, Yun S.
author_facet Benegas, Gonzalo
Batra, Sanjit Singh
Song, Yun S.
author_sort Benegas, Gonzalo
collection PubMed
description The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
format Online
Article
Text
id pubmed-10622914
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-106229142023-11-04 DNA language models are powerful predictors of genome-wide variant effects Benegas, Gonzalo Batra, Sanjit Singh Song, Yun S. Proc Natl Acad Sci U S A Biological Sciences The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome. National Academy of Sciences 2023-10-26 2023-10-31 /pmc/articles/PMC10622914/ /pubmed/37883436 http://dx.doi.org/10.1073/pnas.2311219120 Text en Copyright © 2023 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Benegas, Gonzalo
Batra, Sanjit Singh
Song, Yun S.
DNA language models are powerful predictors of genome-wide variant effects
title DNA language models are powerful predictors of genome-wide variant effects
title_full DNA language models are powerful predictors of genome-wide variant effects
title_fullStr DNA language models are powerful predictors of genome-wide variant effects
title_full_unstemmed DNA language models are powerful predictors of genome-wide variant effects
title_short DNA language models are powerful predictors of genome-wide variant effects
title_sort dna language models are powerful predictors of genome-wide variant effects
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10622914/
https://www.ncbi.nlm.nih.gov/pubmed/37883436
http://dx.doi.org/10.1073/pnas.2311219120
work_keys_str_mv AT benegasgonzalo dnalanguagemodelsarepowerfulpredictorsofgenomewidevarianteffects
AT batrasanjitsingh dnalanguagemodelsarepowerfulpredictorsofgenomewidevarianteffects
AT songyuns dnalanguagemodelsarepowerfulpredictorsofgenomewidevarianteffects