Cargando…

The LabelHash algorithm for substructure matching

BACKGROUND: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consum...

Descripción completa

Detalles Bibliográficos
Autores principales:	Moll, Mark, Bryant, Drew H, Kavraki, Lydia E
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996407/ https://www.ncbi.nlm.nih.gov/pubmed/21070651 http://dx.doi.org/10.1186/1471-2105-11-555

_version_	1782193203439468544
author	Moll, Mark Bryant, Drew H Kavraki, Lydia E
author_facet	Moll, Mark Bryant, Drew H Kavraki, Lydia E
author_sort	Moll, Mark
collection	PubMed
description	BACKGROUND: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. RESULTS: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose. CONCLUSIONS: LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.
format	Text
id	pubmed-2996407
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-29964072011-01-05 The LabelHash algorithm for substructure matching Moll, Mark Bryant, Drew H Kavraki, Lydia E BMC Bioinformatics Methodology Article BACKGROUND: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. RESULTS: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose. CONCLUSIONS: LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm. BioMed Central 2010-11-11 /pmc/articles/PMC2996407/ /pubmed/21070651 http://dx.doi.org/10.1186/1471-2105-11-555 Text en Copyright ©2010 Moll et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Moll, Mark Bryant, Drew H Kavraki, Lydia E The LabelHash algorithm for substructure matching
title	The LabelHash algorithm for substructure matching
title_full	The LabelHash algorithm for substructure matching
title_fullStr	The LabelHash algorithm for substructure matching
title_full_unstemmed	The LabelHash algorithm for substructure matching
title_short	The LabelHash algorithm for substructure matching
title_sort	labelhash algorithm for substructure matching
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996407/ https://www.ncbi.nlm.nih.gov/pubmed/21070651 http://dx.doi.org/10.1186/1471-2105-11-555
work_keys_str_mv	AT mollmark thelabelhashalgorithmforsubstructurematching AT bryantdrewh thelabelhashalgorithmforsubstructurematching AT kavrakilydiae thelabelhashalgorithmforsubstructurematching AT mollmark labelhashalgorithmforsubstructurematching AT bryantdrewh labelhashalgorithmforsubstructurematching AT kavrakilydiae labelhashalgorithmforsubstructurematching

The LabelHash algorithm for substructure matching

Ejemplares similares