Cargando…

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

BACKGROUND: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing tec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P, Rahmann, Sven
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2147039/ https://www.ncbi.nlm.nih.gov/pubmed/17941985 http://dx.doi.org/10.1186/1471-2105-8-396

_version_	1782144349662871552
author	Wittkop, Tobias Baumbach, Jan Lobo, Francisco P Rahmann, Sven
author_facet	Wittkop, Tobias Baumbach, Jan Lobo, Francisco P Rahmann, Sven
author_sort	Wittkop, Tobias
collection	PubMed
description	BACKGROUND: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. RESULTS: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. CONCLUSION: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at .
format	Text
id	pubmed-2147039
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-21470392007-12-19 Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing Wittkop, Tobias Baumbach, Jan Lobo, Francisco P Rahmann, Sven BMC Bioinformatics Research Article BACKGROUND: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. RESULTS: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. CONCLUSION: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at . BioMed Central 2007-10-17 /pmc/articles/PMC2147039/ /pubmed/17941985 http://dx.doi.org/10.1186/1471-2105-8-396 Text en Copyright © 2007 Wittkop et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Wittkop, Tobias Baumbach, Jan Lobo, Francisco P Rahmann, Sven Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title	Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title_full	Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title_fullStr	Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title_full_unstemmed	Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title_short	Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing
title_sort	large scale clustering of protein sequences with force -a layout based heuristic for weighted cluster editing
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2147039/ https://www.ncbi.nlm.nih.gov/pubmed/17941985 http://dx.doi.org/10.1186/1471-2105-8-396
work_keys_str_mv	AT wittkoptobias largescaleclusteringofproteinsequenceswithforcealayoutbasedheuristicforweightedclusterediting AT baumbachjan largescaleclusteringofproteinsequenceswithforcealayoutbasedheuristicforweightedclusterediting AT lobofranciscop largescaleclusteringofproteinsequenceswithforcealayoutbasedheuristicforweightedclusterediting AT rahmannsven largescaleclusteringofproteinsequenceswithforcealayoutbasedheuristicforweightedclusterediting

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Ejemplares similares