Cargando…

Mining for class-specific motifs in protein sequence classification

BACKGROUND: In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to exp...

Descripción completa

Detalles Bibliográficos
Autores principales:	Srinivasan, Satish M, Vural, Suleyman, King, Brian R, Guda, Chittibabu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3610217/ https://www.ncbi.nlm.nih.gov/pubmed/23496846 http://dx.doi.org/10.1186/1471-2105-14-96

_version_	1782264422998212608
author	Srinivasan, Satish M Vural, Suleyman King, Brian R Guda, Chittibabu
author_facet	Srinivasan, Satish M Vural, Suleyman King, Brian R Guda, Chittibabu
author_sort	Srinivasan, Satish M
collection	PubMed
description	BACKGROUND: In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. RESULTS: We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. CONCLUSION: The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms.
format	Online Article Text
id	pubmed-3610217
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36102172013-04-01 Mining for class-specific motifs in protein sequence classification Srinivasan, Satish M Vural, Suleyman King, Brian R Guda, Chittibabu BMC Bioinformatics Methodology Article BACKGROUND: In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. RESULTS: We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. CONCLUSION: The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms. BioMed Central 2013-03-15 /pmc/articles/PMC3610217/ /pubmed/23496846 http://dx.doi.org/10.1186/1471-2105-14-96 Text en Copyright ©2013 Srinivasan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Srinivasan, Satish M Vural, Suleyman King, Brian R Guda, Chittibabu Mining for class-specific motifs in protein sequence classification
title	Mining for class-specific motifs in protein sequence classification
title_full	Mining for class-specific motifs in protein sequence classification
title_fullStr	Mining for class-specific motifs in protein sequence classification
title_full_unstemmed	Mining for class-specific motifs in protein sequence classification
title_short	Mining for class-specific motifs in protein sequence classification
title_sort	mining for class-specific motifs in protein sequence classification
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3610217/ https://www.ncbi.nlm.nih.gov/pubmed/23496846 http://dx.doi.org/10.1186/1471-2105-14-96
work_keys_str_mv	AT srinivasansatishm miningforclassspecificmotifsinproteinsequenceclassification AT vuralsuleyman miningforclassspecificmotifsinproteinsequenceclassification AT kingbrianr miningforclassspecificmotifsinproteinsequenceclassification AT gudachittibabu miningforclassspecificmotifsinproteinsequenceclassification

Mining for class-specific motifs in protein sequence classification

Ejemplares similares