Cargando…

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Capel, Henriette, Weiler, Robin, Dijkstra, Maurits, Vleugels, Reinier, Bloem, Peter, Feenstra, K. Anton
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9512797/ https://www.ncbi.nlm.nih.gov/pubmed/36163232 http://dx.doi.org/10.1038/s41598-022-19608-4

_version_	1784797911445405696
author	Capel, Henriette Weiler, Robin Dijkstra, Maurits Vleugels, Reinier Bloem, Peter Feenstra, K. Anton
author_facet	Capel, Henriette Weiler, Robin Dijkstra, Maurits Vleugels, Reinier Bloem, Peter Feenstra, K. Anton
author_sort	Capel, Henriette
collection	PubMed
description	Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue.
format	Online Article Text
id	pubmed-9512797
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-95127972022-09-28 ProteinGLUE multi-task benchmark suite for self-supervised protein modeling Capel, Henriette Weiler, Robin Dijkstra, Maurits Vleugels, Reinier Bloem, Peter Feenstra, K. Anton Sci Rep Article Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue. Nature Publishing Group UK 2022-09-26 /pmc/articles/PMC9512797/ /pubmed/36163232 http://dx.doi.org/10.1038/s41598-022-19608-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Capel, Henriette Weiler, Robin Dijkstra, Maurits Vleugels, Reinier Bloem, Peter Feenstra, K. Anton ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title	ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title_full	ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title_fullStr	ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title_full_unstemmed	ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title_short	ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
title_sort	proteinglue multi-task benchmark suite for self-supervised protein modeling
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9512797/ https://www.ncbi.nlm.nih.gov/pubmed/36163232 http://dx.doi.org/10.1038/s41598-022-19608-4
work_keys_str_mv	AT capelhenriette proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling AT weilerrobin proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling AT dijkstramaurits proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling AT vleugelsreinier proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling AT bloempeter proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling AT feenstrakanton proteingluemultitaskbenchmarksuiteforselfsupervisedproteinmodeling

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling

Ejemplares similares