Cargando…

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natur...

Descripción completa

Detalles Bibliográficos
Autores principales: Rives, Alexander, Meier, Joshua, Sercu, Tom, Goyal, Siddharth, Lin, Zeming, Liu, Jason, Guo, Demi, Ott, Myle, Zitnick, C. Lawrence, Ma, Jerry, Fergus, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053943/
https://www.ncbi.nlm.nih.gov/pubmed/33876751
http://dx.doi.org/10.1073/pnas.2016239118
_version_ 1783680219606941696
author Rives, Alexander
Meier, Joshua
Sercu, Tom
Goyal, Siddharth
Lin, Zeming
Liu, Jason
Guo, Demi
Ott, Myle
Zitnick, C. Lawrence
Ma, Jerry
Fergus, Rob
author_facet Rives, Alexander
Meier, Joshua
Sercu, Tom
Goyal, Siddharth
Lin, Zeming
Liu, Jason
Guo, Demi
Ott, Myle
Zitnick, C. Lawrence
Ma, Jerry
Fergus, Rob
author_sort Rives, Alexander
collection PubMed
description In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
format Online
Article
Text
id pubmed-8053943
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-80539432021-05-04 Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Rives, Alexander Meier, Joshua Sercu, Tom Goyal, Siddharth Lin, Zeming Liu, Jason Guo, Demi Ott, Myle Zitnick, C. Lawrence Ma, Jerry Fergus, Rob Proc Natl Acad Sci U S A Biological Sciences In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. National Academy of Sciences 2021-04-13 2021-04-05 /pmc/articles/PMC8053943/ /pubmed/33876751 http://dx.doi.org/10.1073/pnas.2016239118 Text en Copyright © 2021 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Rives, Alexander
Meier, Joshua
Sercu, Tom
Goyal, Siddharth
Lin, Zeming
Liu, Jason
Guo, Demi
Ott, Myle
Zitnick, C. Lawrence
Ma, Jerry
Fergus, Rob
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title_full Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title_fullStr Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title_full_unstemmed Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title_short Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
title_sort biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053943/
https://www.ncbi.nlm.nih.gov/pubmed/33876751
http://dx.doi.org/10.1073/pnas.2016239118
work_keys_str_mv AT rivesalexander biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT meierjoshua biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT sercutom biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT goyalsiddharth biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT linzeming biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT liujason biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT guodemi biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT ottmyle biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT zitnickclawrence biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT majerry biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences
AT fergusrob biologicalstructureandfunctionemergefromscalingunsupervisedlearningto250millionproteinsequences