Cargando…
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large qu...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cornell University
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441447/ https://www.ncbi.nlm.nih.gov/pubmed/37608940 |
_version_ | 1785093373696147456 |
---|---|
author | Ahdritz, Gustaf Bouatta, Nazim Kadyan, Sachin Jarosch, Lukas Berenberg, Daniel Fisk, Ian Watkins, Andrew M. Ra, Stephen Bonneau, Richard AlQuraishi, Mohammed |
author_facet | Ahdritz, Gustaf Bouatta, Nazim Kadyan, Sachin Jarosch, Lukas Berenberg, Daniel Fisk, Ian Watkins, Andrew M. Ra, Stephen Bonneau, Richard AlQuraishi, Mohammed |
author_sort | Ahdritz, Gustaf |
collection | PubMed |
description | Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research. |
format | Online Article Text |
id | pubmed-10441447 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cornell University |
record_format | MEDLINE/PubMed |
spelling | pubmed-104414472023-08-22 OpenProteinSet: Training data for structural biology at scale Ahdritz, Gustaf Bouatta, Nazim Kadyan, Sachin Jarosch, Lukas Berenberg, Daniel Fisk, Ian Watkins, Andrew M. Ra, Stephen Bonneau, Richard AlQuraishi, Mohammed ArXiv Article Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research. Cornell University 2023-08-10 /pmc/articles/PMC10441447/ /pubmed/37608940 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Ahdritz, Gustaf Bouatta, Nazim Kadyan, Sachin Jarosch, Lukas Berenberg, Daniel Fisk, Ian Watkins, Andrew M. Ra, Stephen Bonneau, Richard AlQuraishi, Mohammed OpenProteinSet: Training data for structural biology at scale |
title | OpenProteinSet: Training data for structural biology at scale |
title_full | OpenProteinSet: Training data for structural biology at scale |
title_fullStr | OpenProteinSet: Training data for structural biology at scale |
title_full_unstemmed | OpenProteinSet: Training data for structural biology at scale |
title_short | OpenProteinSet: Training data for structural biology at scale |
title_sort | openproteinset: training data for structural biology at scale |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10441447/ https://www.ncbi.nlm.nih.gov/pubmed/37608940 |
work_keys_str_mv | AT ahdritzgustaf openproteinsettrainingdataforstructuralbiologyatscale AT bouattanazim openproteinsettrainingdataforstructuralbiologyatscale AT kadyansachin openproteinsettrainingdataforstructuralbiologyatscale AT jaroschlukas openproteinsettrainingdataforstructuralbiologyatscale AT berenbergdaniel openproteinsettrainingdataforstructuralbiologyatscale AT fiskian openproteinsettrainingdataforstructuralbiologyatscale AT watkinsandrewm openproteinsettrainingdataforstructuralbiologyatscale AT rastephen openproteinsettrainingdataforstructuralbiologyatscale AT bonneaurichard openproteinsettrainingdataforstructuralbiologyatscale AT alquraishimohammed openproteinsettrainingdataforstructuralbiologyatscale |