Cargando…
Protein sequences classification by means of feature extraction with substitution matrices
BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/ https://www.ncbi.nlm.nih.gov/pubmed/20377887 http://dx.doi.org/10.1186/1471-2105-11-175 |
_version_ | 1782181025245298688 |
---|---|
author | Saidi, Rabie Maddouri, Mondher Mephu Nguifo, Engelbert |
author_facet | Saidi, Rabie Maddouri, Mondher Mephu Nguifo, Engelbert |
author_sort | Saidi, Rabie |
collection | PubMed |
description | BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks. |
format | Text |
id | pubmed-2868007 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-28680072010-05-12 Protein sequences classification by means of feature extraction with substitution matrices Saidi, Rabie Maddouri, Mondher Mephu Nguifo, Engelbert BMC Bioinformatics Research article BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS: In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS: The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks. BioMed Central 2010-04-08 /pmc/articles/PMC2868007/ /pubmed/20377887 http://dx.doi.org/10.1186/1471-2105-11-175 Text en Copyright ©2010 Saidi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research article Saidi, Rabie Maddouri, Mondher Mephu Nguifo, Engelbert Protein sequences classification by means of feature extraction with substitution matrices |
title | Protein sequences classification by means of feature extraction with substitution matrices |
title_full | Protein sequences classification by means of feature extraction with substitution matrices |
title_fullStr | Protein sequences classification by means of feature extraction with substitution matrices |
title_full_unstemmed | Protein sequences classification by means of feature extraction with substitution matrices |
title_short | Protein sequences classification by means of feature extraction with substitution matrices |
title_sort | protein sequences classification by means of feature extraction with substitution matrices |
topic | Research article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2868007/ https://www.ncbi.nlm.nih.gov/pubmed/20377887 http://dx.doi.org/10.1186/1471-2105-11-175 |
work_keys_str_mv | AT saidirabie proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices AT maddourimondher proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices AT mephunguifoengelbert proteinsequencesclassificationbymeansoffeatureextractionwithsubstitutionmatrices |