Cargando…

Superior protein thermophilicity prediction with protein language model embeddings

Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are th...

Descripción completa

Detalles Bibliográficos
Autores principales: Haselbeck, Florian, John, Maura, Zhang, Yuqi, Pirnay, Jonathan, Fuenzalida-Werner, Juan Pablo, Costa, Rubén D, Grimm, Dominik G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10566323/
https://www.ncbi.nlm.nih.gov/pubmed/37829176
http://dx.doi.org/10.1093/nargab/lqad087
_version_ 1785118899356827648
author Haselbeck, Florian
John, Maura
Zhang, Yuqi
Pirnay, Jonathan
Fuenzalida-Werner, Juan Pablo
Costa, Rubén D
Grimm, Dominik G
author_facet Haselbeck, Florian
John, Maura
Zhang, Yuqi
Pirnay, Jonathan
Fuenzalida-Werner, Juan Pablo
Costa, Rubén D
Grimm, Dominik G
author_sort Haselbeck, Florian
collection PubMed
description Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
format Online
Article
Text
id pubmed-10566323
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105663232023-10-12 Superior protein thermophilicity prediction with protein language model embeddings Haselbeck, Florian John, Maura Zhang, Yuqi Pirnay, Jonathan Fuenzalida-Werner, Juan Pablo Costa, Rubén D Grimm, Dominik G NAR Genom Bioinform Standard Article Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C. Oxford University Press 2023-10-11 /pmc/articles/PMC10566323/ /pubmed/37829176 http://dx.doi.org/10.1093/nargab/lqad087 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Haselbeck, Florian
John, Maura
Zhang, Yuqi
Pirnay, Jonathan
Fuenzalida-Werner, Juan Pablo
Costa, Rubén D
Grimm, Dominik G
Superior protein thermophilicity prediction with protein language model embeddings
title Superior protein thermophilicity prediction with protein language model embeddings
title_full Superior protein thermophilicity prediction with protein language model embeddings
title_fullStr Superior protein thermophilicity prediction with protein language model embeddings
title_full_unstemmed Superior protein thermophilicity prediction with protein language model embeddings
title_short Superior protein thermophilicity prediction with protein language model embeddings
title_sort superior protein thermophilicity prediction with protein language model embeddings
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10566323/
https://www.ncbi.nlm.nih.gov/pubmed/37829176
http://dx.doi.org/10.1093/nargab/lqad087
work_keys_str_mv AT haselbeckflorian superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT johnmaura superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT zhangyuqi superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT pirnayjonathan superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT fuenzalidawernerjuanpablo superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT costarubend superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings
AT grimmdominikg superiorproteinthermophilicitypredictionwithproteinlanguagemodelembeddings