Cargando…

A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately ide...

Descripción completa

Detalles Bibliográficos
Autores principales: Charoenkwan, Phasit, Chotpatiwetchkul, Warot, Lee, Vannajan Sanghiran, Nantasenamat, Chanin, Shoombuatong, Watshara
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8664844/
https://www.ncbi.nlm.nih.gov/pubmed/34893688
http://dx.doi.org/10.1038/s41598-021-03293-w
_version_ 1784613927866335232
author Charoenkwan, Phasit
Chotpatiwetchkul, Warot
Lee, Vannajan Sanghiran
Nantasenamat, Chanin
Shoombuatong, Watshara
author_facet Charoenkwan, Phasit
Chotpatiwetchkul, Warot
Lee, Vannajan Sanghiran
Nantasenamat, Chanin
Shoombuatong, Watshara
author_sort Charoenkwan, Phasit
collection PubMed
description Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
format Online
Article
Text
id pubmed-8664844
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-86648442021-12-13 A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides Charoenkwan, Phasit Chotpatiwetchkul, Warot Lee, Vannajan Sanghiran Nantasenamat, Chanin Shoombuatong, Watshara Sci Rep Article Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs. Nature Publishing Group UK 2021-12-10 /pmc/articles/PMC8664844/ /pubmed/34893688 http://dx.doi.org/10.1038/s41598-021-03293-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Charoenkwan, Phasit
Chotpatiwetchkul, Warot
Lee, Vannajan Sanghiran
Nantasenamat, Chanin
Shoombuatong, Watshara
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title_full A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title_fullStr A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title_full_unstemmed A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title_short A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
title_sort novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8664844/
https://www.ncbi.nlm.nih.gov/pubmed/34893688
http://dx.doi.org/10.1038/s41598-021-03293-w
work_keys_str_mv AT charoenkwanphasit anovelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT chotpatiwetchkulwarot anovelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT leevannajansanghiran anovelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT nantasenamatchanin anovelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT shoombuatongwatshara anovelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT charoenkwanphasit novelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT chotpatiwetchkulwarot novelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT leevannajansanghiran novelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT nantasenamatchanin novelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides
AT shoombuatongwatshara novelsequencebasedpredictorforidentifyingandcharacterizingthermophilicproteinsusingestimatedpropensityscoresofdipeptides