Cargando…

Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization

The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy an...

Descripción completa

Detalles Bibliográficos
Autores principales: Wattanapornprom, Warin, Thammarongtham, Chinae, Hongsthong, Apiradee, Lertampaiporn, Supatcha
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8066735/
https://www.ncbi.nlm.nih.gov/pubmed/33808227
http://dx.doi.org/10.3390/life11040293
_version_ 1783682638420115456
author Wattanapornprom, Warin
Thammarongtham, Chinae
Hongsthong, Apiradee
Lertampaiporn, Supatcha
author_facet Wattanapornprom, Warin
Thammarongtham, Chinae
Hongsthong, Apiradee
Lertampaiporn, Supatcha
author_sort Wattanapornprom, Warin
collection PubMed
description The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.
format Online
Article
Text
id pubmed-8066735
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-80667352021-04-25 Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization Wattanapornprom, Warin Thammarongtham, Chinae Hongsthong, Apiradee Lertampaiporn, Supatcha Life (Basel) Article The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset. MDPI 2021-03-30 /pmc/articles/PMC8066735/ /pubmed/33808227 http://dx.doi.org/10.3390/life11040293 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wattanapornprom, Warin
Thammarongtham, Chinae
Hongsthong, Apiradee
Lertampaiporn, Supatcha
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_full Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_fullStr Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_full_unstemmed Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_short Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization
title_sort ensemble of multiple classifiers for multilabel classification of plant protein subcellular localization
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8066735/
https://www.ncbi.nlm.nih.gov/pubmed/33808227
http://dx.doi.org/10.3390/life11040293
work_keys_str_mv AT wattanapornpromwarin ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT thammarongthamchinae ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT hongsthongapiradee ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization
AT lertampaipornsupatcha ensembleofmultipleclassifiersformultilabelclassificationofplantproteinsubcellularlocalization