Cargando…

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this ch...

Descripción completa

Detalles Bibliográficos
Autores principales: Benegas, Gonzalo, Albors, Carlos, Aw, Alan J., Ye, Chengzhong, Song, Yun S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10592768/
https://www.ncbi.nlm.nih.gov/pubmed/37873118
http://dx.doi.org/10.1101/2023.10.10.561776
_version_ 1785124340727021568
author Benegas, Gonzalo
Albors, Carlos
Aw, Alan J.
Ye, Chengzhong
Song, Yun S.
author_facet Benegas, Gonzalo
Albors, Carlos
Aw, Alan J.
Ye, Chengzhong
Song, Yun S.
author_sort Benegas, Gonzalo
collection PubMed
description Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, and OMIM) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.
format Online
Article
Text
id pubmed-10592768
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-105927682023-10-24 GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction Benegas, Gonzalo Albors, Carlos Aw, Alan J. Ye, Chengzhong Song, Yun S. bioRxiv Article Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, and OMIM) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants. Cold Spring Harbor Laboratory 2023-10-11 /pmc/articles/PMC10592768/ /pubmed/37873118 http://dx.doi.org/10.1101/2023.10.10.561776 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Benegas, Gonzalo
Albors, Carlos
Aw, Alan J.
Ye, Chengzhong
Song, Yun S.
GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title_full GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title_fullStr GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title_full_unstemmed GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title_short GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
title_sort gpn-msa: an alignment-based dna language model for genome-wide variant effect prediction
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10592768/
https://www.ncbi.nlm.nih.gov/pubmed/37873118
http://dx.doi.org/10.1101/2023.10.10.561776
work_keys_str_mv AT benegasgonzalo gpnmsaanalignmentbaseddnalanguagemodelforgenomewidevarianteffectprediction
AT alborscarlos gpnmsaanalignmentbaseddnalanguagemodelforgenomewidevarianteffectprediction
AT awalanj gpnmsaanalignmentbaseddnalanguagemodelforgenomewidevarianteffectprediction
AT yechengzhong gpnmsaanalignmentbaseddnalanguagemodelforgenomewidevarianteffectprediction
AT songyuns gpnmsaanalignmentbaseddnalanguagemodelforgenomewidevarianteffectprediction