Cargando…

A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem

Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermed...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Dong, Chen, Zhihua, He, Zhanpeng, Huang, Xueqin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8832978/
https://www.ncbi.nlm.nih.gov/pubmed/35154261
http://dx.doi.org/10.3389/fgene.2021.818841
_version_ 1784648826609467392
author Ma, Dong
Chen, Zhihua
He, Zhanpeng
Huang, Xueqin
author_facet Ma, Dong
Chen, Zhihua
He, Zhanpeng
Huang, Xueqin
author_sort Ma, Dong
collection PubMed
description Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods.
format Online
Article
Text
id pubmed-8832978
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-88329782022-02-12 A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem Ma, Dong Chen, Zhihua He, Zhanpeng Huang, Xueqin Front Genet Genetics Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods. Frontiers Media S.A. 2022-01-28 /pmc/articles/PMC8832978/ /pubmed/35154261 http://dx.doi.org/10.3389/fgene.2021.818841 Text en Copyright © 2022 Ma, Chen, He and Huang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Ma, Dong
Chen, Zhihua
He, Zhanpeng
Huang, Xueqin
A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title_full A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title_fullStr A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title_full_unstemmed A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title_short A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem
title_sort snare protein identification method based on ilearnplus to efficiently solve the data imbalance problem
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8832978/
https://www.ncbi.nlm.nih.gov/pubmed/35154261
http://dx.doi.org/10.3389/fgene.2021.818841
work_keys_str_mv AT madong asnareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT chenzhihua asnareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT hezhanpeng asnareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT huangxueqin asnareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT madong snareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT chenzhihua snareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT hezhanpeng snareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem
AT huangxueqin snareproteinidentificationmethodbasedonilearnplustoefficientlysolvethedataimbalanceproblem