Cargando…

On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences

Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets suc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lehmann, Katharina A, Kaufmann, Michael, Steigele, Stephan, Nieselt, Kay
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1524773/ https://www.ncbi.nlm.nih.gov/pubmed/16737529 http://dx.doi.org/10.1186/1748-7188-1-9

_version_	1782128842474782720
author	Lehmann, Katharina A Kaufmann, Michael Steigele, Stephan Nieselt, Kay
author_facet	Lehmann, Katharina A Kaufmann, Michael Steigele, Stephan Nieselt, Kay
author_sort	Lehmann, Katharina A
collection	PubMed
description	Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n(3 )+ out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n(2 )log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets.
format	Text
id	pubmed-1524773
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15247732006-07-29 On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences Lehmann, Katharina A Kaufmann, Michael Steigele, Stephan Nieselt, Kay Algorithms Mol Biol Research Given a set S of n locally aligned sequences, it is a needed prerequisite to partition it into groups of very similar sequences to facilitate subsequent computations, such as the generation of a phylogenetic tree. This article introduces a new method of clustering which partitions S into subsets such that the overlap of each pair of sequences within a subset is at least a given percentage c of the lengths of the two sequences. We show that this problem can be reduced to finding all maximal cliques in a special kind of max-tolerance graph which we call a c-max-tolerance graph. Previously we have shown that finding all maximal cliques in general max-tolerance graphs can be done efficiently in O(n(3 )+ out). Here, using a new kind of sweep-line algorithm, we show that the restriction to c-max-tolerance graphs yields a better runtime of O(n(2 )log n + out). Furthermore, we present another algorithm which is much easier to implement, and though theoretically slower than the first one, is still running in polynomial time. We then experimentally analyze the number and structure of all maximal cliques in a c-max-tolerance graph, depending on the chosen c-value. We apply our simple algorithm to artificial and biological data and we show that this implementation is much faster than the well-known application Cliquer. By introducing a new heuristic that uses the set of all maximal cliques to partition S, we finally show that the computed partition gives a reasonable clustering for biological data sets. BioMed Central 2006-05-31 /pmc/articles/PMC1524773/ /pubmed/16737529 http://dx.doi.org/10.1186/1748-7188-1-9 Text en Copyright © 2006 Lehmann et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lehmann, Katharina A Kaufmann, Michael Steigele, Stephan Nieselt, Kay On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title	On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title_full	On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title_fullStr	On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title_full_unstemmed	On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title_short	On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
title_sort	on the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1524773/ https://www.ncbi.nlm.nih.gov/pubmed/16737529 http://dx.doi.org/10.1186/1748-7188-1-9
work_keys_str_mv	AT lehmannkatharinaa onthemaximalcliquesincmaxtolerancegraphsandtheirapplicationinclusteringmolecularsequences AT kaufmannmichael onthemaximalcliquesincmaxtolerancegraphsandtheirapplicationinclusteringmolecularsequences AT steigelestephan onthemaximalcliquesincmaxtolerancegraphsandtheirapplicationinclusteringmolecularsequences AT nieseltkay onthemaximalcliquesincmaxtolerancegraphsandtheirapplicationinclusteringmolecularsequences

On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences

Ejemplares similares