Cargando…

OpenProteinSet: Training data for structural biology at scale

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large qu...

Descripción completa

Detalles Bibliográficos
Autores principales: Ahdritz, Gustaf, Bouatta, Nazim, Kadyan, Sachin, Jarosch, Lukas, Berenberg, Daniel, Fisk, Ian, Watkins, Andrew M., Ra, Stephen, Bonneau, Richard, AlQuraishi, Mohammed
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441447/
https://www.ncbi.nlm.nih.gov/pubmed/37608940
_version_ 1785093373696147456
author Ahdritz, Gustaf
Bouatta, Nazim
Kadyan, Sachin
Jarosch, Lukas
Berenberg, Daniel
Fisk, Ian
Watkins, Andrew M.
Ra, Stephen
Bonneau, Richard
AlQuraishi, Mohammed
author_facet Ahdritz, Gustaf
Bouatta, Nazim
Kadyan, Sachin
Jarosch, Lukas
Berenberg, Daniel
Fisk, Ian
Watkins, Andrew M.
Ra, Stephen
Bonneau, Richard
AlQuraishi, Mohammed
author_sort Ahdritz, Gustaf
collection PubMed
description Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
format Online
Article
Text
id pubmed-10441447
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-104414472023-08-22 OpenProteinSet: Training data for structural biology at scale Ahdritz, Gustaf Bouatta, Nazim Kadyan, Sachin Jarosch, Lukas Berenberg, Daniel Fisk, Ian Watkins, Andrew M. Ra, Stephen Bonneau, Richard AlQuraishi, Mohammed ArXiv Article Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research. Cornell University 2023-08-10 /pmc/articles/PMC10441447/ /pubmed/37608940 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Ahdritz, Gustaf
Bouatta, Nazim
Kadyan, Sachin
Jarosch, Lukas
Berenberg, Daniel
Fisk, Ian
Watkins, Andrew M.
Ra, Stephen
Bonneau, Richard
AlQuraishi, Mohammed
OpenProteinSet: Training data for structural biology at scale
title OpenProteinSet: Training data for structural biology at scale
title_full OpenProteinSet: Training data for structural biology at scale
title_fullStr OpenProteinSet: Training data for structural biology at scale
title_full_unstemmed OpenProteinSet: Training data for structural biology at scale
title_short OpenProteinSet: Training data for structural biology at scale
title_sort openproteinset: training data for structural biology at scale
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441447/
https://www.ncbi.nlm.nih.gov/pubmed/37608940
work_keys_str_mv AT ahdritzgustaf openproteinsettrainingdataforstructuralbiologyatscale
AT bouattanazim openproteinsettrainingdataforstructuralbiologyatscale
AT kadyansachin openproteinsettrainingdataforstructuralbiologyatscale
AT jaroschlukas openproteinsettrainingdataforstructuralbiologyatscale
AT berenbergdaniel openproteinsettrainingdataforstructuralbiologyatscale
AT fiskian openproteinsettrainingdataforstructuralbiologyatscale
AT watkinsandrewm openproteinsettrainingdataforstructuralbiologyatscale
AT rastephen openproteinsettrainingdataforstructuralbiologyatscale
AT bonneaurichard openproteinsettrainingdataforstructuralbiologyatscale
AT alquraishimohammed openproteinsettrainingdataforstructuralbiologyatscale