Cargando…

FastBLAST: Homology Relationships for Millions of Proteins

BACKGROUND: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-vers...

Descripción completa

Detalles Bibliográficos
Autores principales:	Price, Morgan N., Dehal, Paramvir S., Arkin, Adam P.
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2571987/ https://www.ncbi.nlm.nih.gov/pubmed/18974889 http://dx.doi.org/10.1371/journal.pone.0003589

_version_	1782160227452321792
author	Price, Morgan N. Dehal, Paramvir S. Arkin, Adam P.
author_facet	Price, Morgan N. Dehal, Paramvir S. Arkin, Adam P.
author_sort	Price, Morgan N.
collection	PubMed
description	BACKGROUND: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding. METHODOLOGY/PRINCIPAL FINDINGS: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (“NR”), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query. CONCLUSIONS/SIGNIFICANCE: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast.
format	Text
id	pubmed-2571987
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-25719872008-10-31 FastBLAST: Homology Relationships for Millions of Proteins Price, Morgan N. Dehal, Paramvir S. Arkin, Adam P. PLoS One Research Article BACKGROUND: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding. METHODOLOGY/PRINCIPAL FINDINGS: We present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (“NR”), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query. CONCLUSIONS/SIGNIFICANCE: FastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast. Public Library of Science 2008-10-31 /pmc/articles/PMC2571987/ /pubmed/18974889 http://dx.doi.org/10.1371/journal.pone.0003589 Text en Price et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Price, Morgan N. Dehal, Paramvir S. Arkin, Adam P. FastBLAST: Homology Relationships for Millions of Proteins
title	FastBLAST: Homology Relationships for Millions of Proteins
title_full	FastBLAST: Homology Relationships for Millions of Proteins
title_fullStr	FastBLAST: Homology Relationships for Millions of Proteins
title_full_unstemmed	FastBLAST: Homology Relationships for Millions of Proteins
title_short	FastBLAST: Homology Relationships for Millions of Proteins
title_sort	fastblast: homology relationships for millions of proteins
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2571987/ https://www.ncbi.nlm.nih.gov/pubmed/18974889 http://dx.doi.org/10.1371/journal.pone.0003589
work_keys_str_mv	AT pricemorgann fastblasthomologyrelationshipsformillionsofproteins AT dehalparamvirs fastblasthomologyrelationshipsformillionsofproteins AT arkinadamp fastblasthomologyrelationshipsformillionsofproteins

FastBLAST: Homology Relationships for Millions of Proteins

Ejemplares similares