Cargando…

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap...

Descripción completa

Detalles Bibliográficos
Autores principales:	Suzek, Baris E., Wang, Yuqi, Huang, Hongzhan, McGarvey, Peter B., Wu, Cathy H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2015
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/ https://www.ncbi.nlm.nih.gov/pubmed/25398609 http://dx.doi.org/10.1093/bioinformatics/btu739

_version_	1782363596123013120
author	Suzek, Baris E. Wang, Yuqi Huang, Hongzhan McGarvey, Peter B. Wu, Cathy H.
author_facet	Suzek, Baris E. Wang, Yuqi Huang, Hongzhan McGarvey, Peter B. Wu, Cathy H.
author_sort	Suzek, Baris E.
collection	PubMed
description	Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation. Availability and implementation: Web access and file download from UniProt website at http://www.uniprot.org/uniref and ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. BLAST searches against UniRef are available at http://www.uniprot.org/blast/ Contact: huang@dbi.udel.edu
format	Online Article Text
id	pubmed-4375400
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-43754002015-04-15 UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches Suzek, Baris E. Wang, Yuqi Huang, Hongzhan McGarvey, Peter B. Wu, Cathy H. Bioinformatics Original Papers Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation. Availability and implementation: Web access and file download from UniProt website at http://www.uniprot.org/uniref and ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. BLAST searches against UniRef are available at http://www.uniprot.org/blast/ Contact: huang@dbi.udel.edu Oxford University Press 2015-03-15 2014-11-13 /pmc/articles/PMC4375400/ /pubmed/25398609 http://dx.doi.org/10.1093/bioinformatics/btu739 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Suzek, Baris E. Wang, Yuqi Huang, Hongzhan McGarvey, Peter B. Wu, Cathy H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title	UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title_full	UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title_fullStr	UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title_full_unstemmed	UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title_short	UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
title_sort	uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/ https://www.ncbi.nlm.nih.gov/pubmed/25398609 http://dx.doi.org/10.1093/bioinformatics/btu739
work_keys_str_mv	AT suzekbarise unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches AT wangyuqi unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches AT huanghongzhan unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches AT mcgarveypeterb unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches AT wucathyh unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches AT unirefclustersacomprehensiveandscalablealternativeforimprovingsequencesimilaritysearches

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

Ejemplares similares