Cargando…

Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features

Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different famil...

Descripción completa

Detalles Bibliográficos
Autores principales: Jing, Xiao-Yang, Li, Feng-Min
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7530508/
https://www.ncbi.nlm.nih.gov/pubmed/33029195
http://dx.doi.org/10.1155/2020/8894478
_version_ 1783589583802335232
author Jing, Xiao-Yang
Li, Feng-Min
author_facet Jing, Xiao-Yang
Li, Feng-Min
author_sort Jing, Xiao-Yang
collection PubMed
description Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.
format Online
Article
Text
id pubmed-7530508
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-75305082020-10-06 Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features Jing, Xiao-Yang Li, Feng-Min Comput Math Methods Med Research Article Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction. Hindawi 2020-09-23 /pmc/articles/PMC7530508/ /pubmed/33029195 http://dx.doi.org/10.1155/2020/8894478 Text en Copyright © 2020 Xiao-Yang Jing and Feng-Min Li. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Jing, Xiao-Yang
Li, Feng-Min
Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title_full Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title_fullStr Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title_full_unstemmed Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title_short Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features
title_sort identifying heat shock protein families from imbalanced data by using combined features
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7530508/
https://www.ncbi.nlm.nih.gov/pubmed/33029195
http://dx.doi.org/10.1155/2020/8894478
work_keys_str_mv AT jingxiaoyang identifyingheatshockproteinfamiliesfromimbalanceddatabyusingcombinedfeatures
AT lifengmin identifyingheatshockproteinfamiliesfromimbalanceddatabyusingcombinedfeatures