Cargando…

CLUSS: Clustering of protein sequences based on a new similarity measure

BACKGROUND: The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowled...

Descripción completa

Detalles Bibliográficos
Autores principales: Kelil, Abdellali, Wang, Shengrui, Brzezinski, Ryszard, Fleury, Alain
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1976428/
https://www.ncbi.nlm.nih.gov/pubmed/17683581
http://dx.doi.org/10.1186/1471-2105-8-286
_version_ 1782135082639687680
author Kelil, Abdellali
Wang, Shengrui
Brzezinski, Ryszard
Fleury, Alain
author_facet Kelil, Abdellali
Wang, Shengrui
Brzezinski, Ryszard
Fleury, Alain
author_sort Kelil, Abdellali
collection PubMed
description BACKGROUND: The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions". RESULTS: To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity. CONCLUSION: We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.
format Text
id pubmed-1976428
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-19764282007-09-14 CLUSS: Clustering of protein sequences based on a new similarity measure Kelil, Abdellali Wang, Shengrui Brzezinski, Ryszard Fleury, Alain BMC Bioinformatics Software BACKGROUND: The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions". RESULTS: To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity. CONCLUSION: We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms. BioMed Central 2007-08-04 /pmc/articles/PMC1976428/ /pubmed/17683581 http://dx.doi.org/10.1186/1471-2105-8-286 Text en Copyright © 2007 Kelil et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Kelil, Abdellali
Wang, Shengrui
Brzezinski, Ryszard
Fleury, Alain
CLUSS: Clustering of protein sequences based on a new similarity measure
title CLUSS: Clustering of protein sequences based on a new similarity measure
title_full CLUSS: Clustering of protein sequences based on a new similarity measure
title_fullStr CLUSS: Clustering of protein sequences based on a new similarity measure
title_full_unstemmed CLUSS: Clustering of protein sequences based on a new similarity measure
title_short CLUSS: Clustering of protein sequences based on a new similarity measure
title_sort cluss: clustering of protein sequences based on a new similarity measure
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1976428/
https://www.ncbi.nlm.nih.gov/pubmed/17683581
http://dx.doi.org/10.1186/1471-2105-8-286
work_keys_str_mv AT kelilabdellali clussclusteringofproteinsequencesbasedonanewsimilaritymeasure
AT wangshengrui clussclusteringofproteinsequencesbasedonanewsimilaritymeasure
AT brzezinskiryszard clussclusteringofproteinsequencesbasedonanewsimilaritymeasure
AT fleuryalain clussclusteringofproteinsequencesbasedonanewsimilaritymeasure