Cargando…

Partitioning clustering algorithms for protein sequence data sets

BACKGROUND: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research obje...

Descripción completa

Detalles Bibliográficos
Autores principales: Fayech, Sondes, Essoussi, Nadia, Limam, Mohamed
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2678123/
https://www.ncbi.nlm.nih.gov/pubmed/19341454
http://dx.doi.org/10.1186/1756-0381-2-3
_version_ 1782166823064567808
author Fayech, Sondes
Essoussi, Nadia
Limam, Mohamed
author_facet Fayech, Sondes
Essoussi, Nadia
Limam, Mohamed
author_sort Fayech, Sondes
collection PubMed
description BACKGROUND: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. METHODS: We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. RESULTS: We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.
format Text
id pubmed-2678123
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26781232009-05-07 Partitioning clustering algorithms for protein sequence data sets Fayech, Sondes Essoussi, Nadia Limam, Mohamed BioData Min Research BACKGROUND: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. METHODS: We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. RESULTS: We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request. BioMed Central 2009-04-02 /pmc/articles/PMC2678123/ /pubmed/19341454 http://dx.doi.org/10.1186/1756-0381-2-3 Text en Copyright © 2009 Fayech et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Fayech, Sondes
Essoussi, Nadia
Limam, Mohamed
Partitioning clustering algorithms for protein sequence data sets
title Partitioning clustering algorithms for protein sequence data sets
title_full Partitioning clustering algorithms for protein sequence data sets
title_fullStr Partitioning clustering algorithms for protein sequence data sets
title_full_unstemmed Partitioning clustering algorithms for protein sequence data sets
title_short Partitioning clustering algorithms for protein sequence data sets
title_sort partitioning clustering algorithms for protein sequence data sets
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2678123/
https://www.ncbi.nlm.nih.gov/pubmed/19341454
http://dx.doi.org/10.1186/1756-0381-2-3
work_keys_str_mv AT fayechsondes partitioningclusteringalgorithmsforproteinsequencedatasets
AT essoussinadia partitioningclusteringalgorithmsforproteinsequencedatasets
AT limammohamed partitioningclusteringalgorithmsforproteinsequencedatasets