Cargando…

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors

BACKGROUND: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification o...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Jian, Lv, Lixin, Lu, Donglei, Kong, Denan, Al-Alashaari, Mohammed Abdoh Ali, Zhao, Xudong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7590791/
https://www.ncbi.nlm.nih.gov/pubmed/33109082
http://dx.doi.org/10.1186/s12859-020-03826-6
_version_ 1783600869080563712
author Zhang, Jian
Lv, Lixin
Lu, Donglei
Kong, Denan
Al-Alashaari, Mohammed Abdoh Ali
Zhao, Xudong
author_facet Zhang, Jian
Lv, Lixin
Lu, Donglei
Kong, Denan
Al-Alashaari, Mohammed Abdoh Ali
Zhao, Xudong
author_sort Zhang, Jian
collection PubMed
description BACKGROUND: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. RESULTS: Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. CONCLUSIONS: Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
format Online
Article
Text
id pubmed-7590791
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-75907912020-10-27 Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors Zhang, Jian Lv, Lixin Lu, Donglei Kong, Denan Al-Alashaari, Mohammed Abdoh Ali Zhao, Xudong BMC Bioinformatics Methodology Article BACKGROUND: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. RESULTS: Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. CONCLUSIONS: Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result. BioMed Central 2020-10-27 /pmc/articles/PMC7590791/ /pubmed/33109082 http://dx.doi.org/10.1186/s12859-020-03826-6 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Zhang, Jian
Lv, Lixin
Lu, Donglei
Kong, Denan
Al-Alashaari, Mohammed Abdoh Ali
Zhao, Xudong
Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title_full Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title_fullStr Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title_full_unstemmed Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title_short Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors
title_sort variable selection from a feature representing protein sequences: a case of classification on bacterial type iv secreted effectors
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7590791/
https://www.ncbi.nlm.nih.gov/pubmed/33109082
http://dx.doi.org/10.1186/s12859-020-03826-6
work_keys_str_mv AT zhangjian variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors
AT lvlixin variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors
AT ludonglei variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors
AT kongdenan variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors
AT alalashaarimohammedabdohali variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors
AT zhaoxudong variableselectionfromafeaturerepresentingproteinsequencesacaseofclassificationonbacterialtypeivsecretedeffectors