Cargando…

A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data

The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist i...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Runtao, Zhang, Chengjin, Gao, Rui, Zhang, Lina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4783950/
https://www.ncbi.nlm.nih.gov/pubmed/26861308
http://dx.doi.org/10.3390/ijms17020218
_version_ 1782420185335988224
author Yang, Runtao
Zhang, Chengjin
Gao, Rui
Zhang, Lina
author_facet Yang, Runtao
Zhang, Chengjin
Gao, Rui
Zhang, Lina
author_sort Yang, Runtao
collection PubMed
description The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew’s Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions.
format Online
Article
Text
id pubmed-4783950
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-47839502016-03-14 A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data Yang, Runtao Zhang, Chengjin Gao, Rui Zhang, Lina Int J Mol Sci Article The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew’s Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions. MDPI 2016-02-06 /pmc/articles/PMC4783950/ /pubmed/26861308 http://dx.doi.org/10.3390/ijms17020218 Text en © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yang, Runtao
Zhang, Chengjin
Gao, Rui
Zhang, Lina
A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title_full A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title_fullStr A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title_full_unstemmed A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title_short A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data
title_sort novel feature extraction method with feature selection to identify golgi-resident protein types from imbalanced data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4783950/
https://www.ncbi.nlm.nih.gov/pubmed/26861308
http://dx.doi.org/10.3390/ijms17020218
work_keys_str_mv AT yangruntao anovelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT zhangchengjin anovelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT gaorui anovelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT zhanglina anovelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT yangruntao novelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT zhangchengjin novelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT gaorui novelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata
AT zhanglina novelfeatureextractionmethodwithfeatureselectiontoidentifygolgiresidentproteintypesfromimbalanceddata