Cargando…

Genome-wide prediction of disease variant effects with a deep protein language model

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-mil...

Descripción completa

Detalles Bibliográficos
Autores principales: Brandes, Nadav, Goldman, Grant, Wang, Charlotte H., Ye, Chun Jimmie, Ntranos, Vasilis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group US 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10484790/
https://www.ncbi.nlm.nih.gov/pubmed/37563329
http://dx.doi.org/10.1038/s41588-023-01465-0
_version_ 1785102661199069184
author Brandes, Nadav
Goldman, Grant
Wang, Charlotte H.
Ye, Chun Jimmie
Ntranos, Vasilis
author_facet Brandes, Nadav
Goldman, Grant
Wang, Charlotte H.
Ye, Chun Jimmie
Ntranos, Vasilis
author_sort Brandes, Nadav
collection PubMed
description Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.
format Online
Article
Text
id pubmed-10484790
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group US
record_format MEDLINE/PubMed
spelling pubmed-104847902023-09-09 Genome-wide prediction of disease variant effects with a deep protein language model Brandes, Nadav Goldman, Grant Wang, Charlotte H. Ye, Chun Jimmie Ntranos, Vasilis Nat Genet Article Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects. Nature Publishing Group US 2023-08-10 2023 /pmc/articles/PMC10484790/ /pubmed/37563329 http://dx.doi.org/10.1038/s41588-023-01465-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Brandes, Nadav
Goldman, Grant
Wang, Charlotte H.
Ye, Chun Jimmie
Ntranos, Vasilis
Genome-wide prediction of disease variant effects with a deep protein language model
title Genome-wide prediction of disease variant effects with a deep protein language model
title_full Genome-wide prediction of disease variant effects with a deep protein language model
title_fullStr Genome-wide prediction of disease variant effects with a deep protein language model
title_full_unstemmed Genome-wide prediction of disease variant effects with a deep protein language model
title_short Genome-wide prediction of disease variant effects with a deep protein language model
title_sort genome-wide prediction of disease variant effects with a deep protein language model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10484790/
https://www.ncbi.nlm.nih.gov/pubmed/37563329
http://dx.doi.org/10.1038/s41588-023-01465-0
work_keys_str_mv AT brandesnadav genomewidepredictionofdiseasevarianteffectswithadeepproteinlanguagemodel
AT goldmangrant genomewidepredictionofdiseasevarianteffectswithadeepproteinlanguagemodel
AT wangcharlotteh genomewidepredictionofdiseasevarianteffectswithadeepproteinlanguagemodel
AT yechunjimmie genomewidepredictionofdiseasevarianteffectswithadeepproteinlanguagemodel
AT ntranosvasilis genomewidepredictionofdiseasevarianteffectswithadeepproteinlanguagemodel