Cargando…

A hybrid clustering approach to recognition of protein families in 114 microbial genomes

BACKGROUND: Grouping proteins into sequence-based clusters is a fundamental step in many bioinformatic analyses (e.g., homology-based prediction of structure or function). Standard clustering methods such as single-linkage clustering capture a history of cluster topologies as a function of threshold...

Descripción completa

Detalles Bibliográficos
Autores principales: Harlow, Timothy J, Gogarten, J Peter, Ragan, Mark A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2004
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC420232/
https://www.ncbi.nlm.nih.gov/pubmed/15115543
http://dx.doi.org/10.1186/1471-2105-5-45
_version_ 1782121467235794944
author Harlow, Timothy J
Gogarten, J Peter
Ragan, Mark A
author_facet Harlow, Timothy J
Gogarten, J Peter
Ragan, Mark A
author_sort Harlow, Timothy J
collection PubMed
description BACKGROUND: Grouping proteins into sequence-based clusters is a fundamental step in many bioinformatic analyses (e.g., homology-based prediction of structure or function). Standard clustering methods such as single-linkage clustering capture a history of cluster topologies as a function of threshold, but in practice their usefulness is limited because unrelated sequences join clusters before biologically meaningful families are fully constituted, e.g. as the result of matches to so-called promiscuous domains. Use of the Markov Cluster algorithm avoids this non-specificity, but does not preserve topological or threshold information about protein families. RESULTS: We describe a hybrid approach to sequence-based clustering of proteins that combines the advantages of standard and Markov clustering. We have implemented this hybrid approach over a relational database environment, and describe its application to clustering a large subset of PDB, and to 328577 proteins from 114 fully sequenced microbial genomes. To demonstrate utility with difficult problems, we show that hybrid clustering allows us to constitute the paralogous family of ATP synthase F1 rotary motor subunits into a single, biologically interpretable hierarchical grouping that was not accessible using either single-linkage or Markov clustering alone. We describe validation of this method by hybrid clustering of PDB and mapping SCOP families and domains onto the resulting clusters. CONCLUSION: Hybrid (Markov followed by single-linkage) clustering combines the advantages of the Markov Cluster algorithm (avoidance of non-specific clusters resulting from matches to promiscuous domains) and single-linkage clustering (preservation of topological information as a function of threshold). Within the individual Markov clusters, single-linkage clustering is a more-precise instrument, discerning sub-clusters of biological relevance. Our hybrid approach thus provides a computationally efficient approach to the automated recognition of protein families for phylogenomic analysis.
format Text
id pubmed-420232
institution National Center for Biotechnology Information
language English
publishDate 2004
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-4202322004-06-06 A hybrid clustering approach to recognition of protein families in 114 microbial genomes Harlow, Timothy J Gogarten, J Peter Ragan, Mark A BMC Bioinformatics Methodology Article BACKGROUND: Grouping proteins into sequence-based clusters is a fundamental step in many bioinformatic analyses (e.g., homology-based prediction of structure or function). Standard clustering methods such as single-linkage clustering capture a history of cluster topologies as a function of threshold, but in practice their usefulness is limited because unrelated sequences join clusters before biologically meaningful families are fully constituted, e.g. as the result of matches to so-called promiscuous domains. Use of the Markov Cluster algorithm avoids this non-specificity, but does not preserve topological or threshold information about protein families. RESULTS: We describe a hybrid approach to sequence-based clustering of proteins that combines the advantages of standard and Markov clustering. We have implemented this hybrid approach over a relational database environment, and describe its application to clustering a large subset of PDB, and to 328577 proteins from 114 fully sequenced microbial genomes. To demonstrate utility with difficult problems, we show that hybrid clustering allows us to constitute the paralogous family of ATP synthase F1 rotary motor subunits into a single, biologically interpretable hierarchical grouping that was not accessible using either single-linkage or Markov clustering alone. We describe validation of this method by hybrid clustering of PDB and mapping SCOP families and domains onto the resulting clusters. CONCLUSION: Hybrid (Markov followed by single-linkage) clustering combines the advantages of the Markov Cluster algorithm (avoidance of non-specific clusters resulting from matches to promiscuous domains) and single-linkage clustering (preservation of topological information as a function of threshold). Within the individual Markov clusters, single-linkage clustering is a more-precise instrument, discerning sub-clusters of biological relevance. Our hybrid approach thus provides a computationally efficient approach to the automated recognition of protein families for phylogenomic analysis. BioMed Central 2004-04-29 /pmc/articles/PMC420232/ /pubmed/15115543 http://dx.doi.org/10.1186/1471-2105-5-45 Text en Copyright © 2004 Harlow et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle Methodology Article
Harlow, Timothy J
Gogarten, J Peter
Ragan, Mark A
A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title_full A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title_fullStr A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title_full_unstemmed A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title_short A hybrid clustering approach to recognition of protein families in 114 microbial genomes
title_sort hybrid clustering approach to recognition of protein families in 114 microbial genomes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC420232/
https://www.ncbi.nlm.nih.gov/pubmed/15115543
http://dx.doi.org/10.1186/1471-2105-5-45
work_keys_str_mv AT harlowtimothyj ahybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes
AT gogartenjpeter ahybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes
AT raganmarka ahybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes
AT harlowtimothyj hybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes
AT gogartenjpeter hybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes
AT raganmarka hybridclusteringapproachtorecognitionofproteinfamiliesin114microbialgenomes