Cargando…

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their b...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Chao, Wu, Jin, Xu, Lei, Zou, Quan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8116686/
https://www.ncbi.nlm.nih.gov/pubmed/33245691
http://dx.doi.org/10.1099/mgen.0.000483
_version_ 1783691447364485120
author Wang, Chao
Wu, Jin
Xu, Lei
Zou, Quan
author_facet Wang, Chao
Wu, Jin
Xu, Lei
Zou, Quan
author_sort Wang, Chao
collection PubMed
description Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew’s correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.
format Online
Article
Text
id pubmed-8116686
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-81166862021-05-13 NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data Wang, Chao Wu, Jin Xu, Lei Zou, Quan Microb Genom Method Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew’s correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/. Microbiology Society 2020-11-27 /pmc/articles/PMC8116686/ /pubmed/33245691 http://dx.doi.org/10.1099/mgen.0.000483 Text en © 2020 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Method
Wang, Chao
Wu, Jin
Xu, Lei
Zou, Quan
NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title_full NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title_fullStr NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title_full_unstemmed NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title_short NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
title_sort nonclasgp-pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8116686/
https://www.ncbi.nlm.nih.gov/pubmed/33245691
http://dx.doi.org/10.1099/mgen.0.000483
work_keys_str_mv AT wangchao nonclasgppredrobustandefficientpredictionofnonclassicallysecretedproteinsbyintegratingsubsetspecificoptimalmodelsofimbalanceddata
AT wujin nonclasgppredrobustandefficientpredictionofnonclassicallysecretedproteinsbyintegratingsubsetspecificoptimalmodelsofimbalanceddata
AT xulei nonclasgppredrobustandefficientpredictionofnonclassicallysecretedproteinsbyintegratingsubsetspecificoptimalmodelsofimbalanceddata
AT zouquan nonclasgppredrobustandefficientpredictionofnonclassicallysecretedproteinsbyintegratingsubsetspecificoptimalmodelsofimbalanceddata