Cargando…

Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory

Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bull, Simon C., Muldoon, Mark R., Doig, Andrew J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564766/ https://www.ncbi.nlm.nih.gov/pubmed/23393584 http://dx.doi.org/10.1371/journal.pone.0055484

_version_	1782258350456569856
author	Bull, Simon C. Muldoon, Mark R. Doig, Andrew J.
author_facet	Bull, Simon C. Muldoon, Mark R. Doig, Andrew J.
author_sort	Bull, Simon C.
collection	PubMed
description	Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible non-redundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. Leaf gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating non-redundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm is available at: https://github.com/SimonCB765/Leaf, and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/.
format	Online Article Text
id	pubmed-3564766
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-35647662013-02-07 Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory Bull, Simon C. Muldoon, Mark R. Doig, Andrew J. PLoS One Research Article Analysis of protein data sets often requires prior removal of redundancy, so that data is not biased by containing similar proteins. This is usually achieved by pairwise comparison of sequences, followed by purging so that no two pairs have similarities above a chosen threshold. From a starting set, such as the PDB or a genome, one should remove as few sequences as possible, to give the largest possible non-redundant set for subsequent analysis. Protein redundancy can be represented as a graph, with proteins as nodes connected by undirected edges, if they have a pairwise similarity above the chosen threshold. The problem is then equivalent to finding the maximum independent set (MIS), where as few nodes are removed as possible to remove all edges. We tested seven MIS algorithms, three of which are new. We applied the methods to the PDB, subsets of the PDB, various genomes and the BHOLSIB benchmark datasets. For PDB subsets of up to 1000 proteins, we could compare to the exact MIS, found by the Cliquer algorithm. The best algorithm was the new method, Leaf. This works by adding clique members that have no edges to nodes outside the clique to the MIS, starting with the smallest cliques. For PDB subsets of up to 1000 members, it usually finds the MIS and is fast enough to apply to data sets of tens of thousands of proteins. Leaf gives sets that are around 10% larger than the commonly used PISCES algorithm, that are of identical quality. We therefore suggest that Leaf should be the method of choice for generating non-redundant protein data sets, though it is ineffective on dense graphs, such as the BHOLSIB benchmarks. The Leaf algorithm is available at: https://github.com/SimonCB765/Leaf, and sets from genomes and the PDB are available at: http://www.bioinf.manchester.ac.uk/leaf/. Public Library of Science 2013-02-05 /pmc/articles/PMC3564766/ /pubmed/23393584 http://dx.doi.org/10.1371/journal.pone.0055484 Text en © 2013 Bull et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Bull, Simon C. Muldoon, Mark R. Doig, Andrew J. Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title	Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title_full	Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title_fullStr	Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title_full_unstemmed	Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title_short	Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory
title_sort	maximising the size of non-redundant protein datasets using graph theory
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564766/ https://www.ncbi.nlm.nih.gov/pubmed/23393584 http://dx.doi.org/10.1371/journal.pone.0055484
work_keys_str_mv	AT bullsimonc maximisingthesizeofnonredundantproteindatasetsusinggraphtheory AT muldoonmarkr maximisingthesizeofnonredundantproteindatasetsusinggraphtheory AT doigandrewj maximisingthesizeofnonredundantproteindatasetsusinggraphtheory

Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory

Ejemplares similares