Cargando…

Protein sequences classification by means of feature extraction with substitution matrices

BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require...

Descripción completa

Detalles Bibliográficos
Autores principales: Saidi, Rabie, Maddouri, Mondher, Mephu Nguifo, Engelbert
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/
https://www.ncbi.nlm.nih.gov/pubmed/20377887
http://dx.doi.org/10.1186/1471-2105-11-175
_version_ 1782181025245298688
author Saidi, Rabie
Maddouri, Mondher
Mephu Nguifo, Engelbert
author_facet Saidi, Rabie
Maddouri, Mondher
Mephu Nguifo, Engelbert
author_sort Saidi, Rabie
collection PubMed
description BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.
format Text
id pubmed-2868007
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28680072010-05-12 Protein sequences classification by means of feature extraction with substitution matrices Saidi, Rabie Maddouri, Mondher Mephu Nguifo, Engelbert BMC Bioinformatics Research article BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks. BioMed Central 2010-04-08 /pmc/articles/PMC2868007/ /pubmed/20377887 http://dx.doi.org/10.1186/1471-2105-11-175 Text en Copyright ©2010 Saidi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Saidi, Rabie
Maddouri, Mondher
Mephu Nguifo, Engelbert
Protein sequences classification by means of feature extraction with substitution matrices
title Protein sequences classification by means of feature extraction with substitution matrices
title_full Protein sequences classification by means of feature extraction with substitution matrices
title_fullStr Protein sequences classification by means of feature extraction with substitution matrices
title_full_unstemmed Protein sequences classification by means of feature extraction with substitution matrices
title_short Protein sequences classification by means of feature extraction with substitution matrices
title_sort protein sequences classification by means of feature extraction with substitution matrices
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/
https://www.ncbi.nlm.nih.gov/pubmed/20377887
http://dx.doi.org/10.1186/1471-2105-11-175
work_keys_str_mv AT saidirabie proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices
AT maddourimondher proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices
AT mephunguifoengelbert proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices