Cargando…

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Qu, Yang, Niu, Zitong, Ding, Qiaojiao, Zhao, Taowa, Kong, Tong, Bai, Bing, Ma, Jianwei, Zhao, Yitian, Zheng, Jianping
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10671426/ https://www.ncbi.nlm.nih.gov/pubmed/38003686 http://dx.doi.org/10.3390/ijms242216496

_version_	1785149414362316800
author	Qu, Yang Niu, Zitong Ding, Qiaojiao Zhao, Taowa Kong, Tong Bai, Bing Ma, Jianwei Zhao, Yitian Zheng, Jianping
author_facet	Qu, Yang Niu, Zitong Ding, Qiaojiao Zhao, Taowa Kong, Tong Bai, Bing Ma, Jianwei Zhao, Yitian Zheng, Jianping
author_sort	Qu, Yang
collection	PubMed
description	Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.
format	Online Article Text
id	pubmed-10671426
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106714262023-11-18 Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction Qu, Yang Niu, Zitong Ding, Qiaojiao Zhao, Taowa Kong, Tong Bai, Bing Ma, Jianwei Zhao, Yitian Zheng, Jianping Int J Mol Sci Article Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features. MDPI 2023-11-18 /pmc/articles/PMC10671426/ /pubmed/38003686 http://dx.doi.org/10.3390/ijms242216496 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Qu, Yang Niu, Zitong Ding, Qiaojiao Zhao, Taowa Kong, Tong Bai, Bing Ma, Jianwei Zhao, Yitian Zheng, Jianping Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title	Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title_full	Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title_fullStr	Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title_full_unstemmed	Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title_short	Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
title_sort	ensemble learning with supervised methods based on large-scale protein language models for protein mutation effects prediction
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10671426/ https://www.ncbi.nlm.nih.gov/pubmed/38003686 http://dx.doi.org/10.3390/ijms242216496
work_keys_str_mv	AT quyang ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT niuzitong ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT dingqiaojiao ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT zhaotaowa ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT kongtong ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT baibing ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT majianwei ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT zhaoyitian ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction AT zhengjianping ensemblelearningwithsupervisedmethodsbasedonlargescaleproteinlanguagemodelsforproteinmutationeffectsprediction

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Ejemplares similares