Cargando…

Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble

BACKGROUND: Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feat...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Shunfang, Deng, Lin, Xia, Xinnan, Cao, Zicheng, Fei, Yu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220696/
https://www.ncbi.nlm.nih.gov/pubmed/34162327
http://dx.doi.org/10.1186/s12859-021-04251-z
_version_ 1783711196761817088
author Wang, Shunfang
Deng, Lin
Xia, Xinnan
Cao, Zicheng
Fei, Yu
author_facet Wang, Shunfang
Deng, Lin
Xia, Xinnan
Cao, Zicheng
Fei, Yu
author_sort Wang, Shunfang
collection PubMed
description BACKGROUND: Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance. RESULTS: In this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC. CONCLUSION: The experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent.
format Online
Article
Text
id pubmed-8220696
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82206962021-06-23 Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble Wang, Shunfang Deng, Lin Xia, Xinnan Cao, Zicheng Fei, Yu BMC Bioinformatics Research BACKGROUND: Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance. RESULTS: In this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC. CONCLUSION: The experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent. BioMed Central 2021-06-23 /pmc/articles/PMC8220696/ /pubmed/34162327 http://dx.doi.org/10.1186/s12859-021-04251-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Wang, Shunfang
Deng, Lin
Xia, Xinnan
Cao, Zicheng
Fei, Yu
Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title_full Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title_fullStr Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title_full_unstemmed Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title_short Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
title_sort predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220696/
https://www.ncbi.nlm.nih.gov/pubmed/34162327
http://dx.doi.org/10.1186/s12859-021-04251-z
work_keys_str_mv AT wangshunfang predictingantifreezeproteinswithweightedgeneralizeddipeptidecompositionandmultiregressionfeatureselectionensemble
AT denglin predictingantifreezeproteinswithweightedgeneralizeddipeptidecompositionandmultiregressionfeatureselectionensemble
AT xiaxinnan predictingantifreezeproteinswithweightedgeneralizeddipeptidecompositionandmultiregressionfeatureselectionensemble
AT caozicheng predictingantifreezeproteinswithweightedgeneralizeddipeptidecompositionandmultiregressionfeatureselectionensemble
AT feiyu predictingantifreezeproteinswithweightedgeneralizeddipeptidecompositionandmultiregressionfeatureselectionensemble