Cargando…

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from ma...

Descripción completa

Detalles Bibliográficos
Autores principales: van den Bent, Irene, Makrodimitris, Stavros, Reinders, Marcel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8647222/
https://www.ncbi.nlm.nih.gov/pubmed/34880594
http://dx.doi.org/10.1177/11769343211062608
_version_ 1784610569426305024
author van den Bent, Irene
Makrodimitris, Stavros
Reinders, Marcel
author_facet van den Bent, Irene
Makrodimitris, Stavros
Reinders, Marcel
author_sort van den Bent, Irene
collection PubMed
description Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.
format Online
Article
Text
id pubmed-8647222
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-86472222021-12-07 The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction van den Bent, Irene Makrodimitris, Stavros Reinders, Marcel Evol Bioinform Online Original Research Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms. SAGE Publications 2021-12-03 /pmc/articles/PMC8647222/ /pubmed/34880594 http://dx.doi.org/10.1177/11769343211062608 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research
van den Bent, Irene
Makrodimitris, Stavros
Reinders, Marcel
The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_full The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_fullStr The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_full_unstemmed The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_short The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_sort power of universal contextualized protein embeddings in cross-species protein function prediction
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8647222/
https://www.ncbi.nlm.nih.gov/pubmed/34880594
http://dx.doi.org/10.1177/11769343211062608
work_keys_str_mv AT vandenbentirene thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT makrodimitrisstavros thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT reindersmarcel thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT vandenbentirene powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT makrodimitrisstavros powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT reindersmarcel powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction