Cargando…

Compressive genomics for protein databases

Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Daniels, Noah M., Gallant, Andrew, Peng, Jian, Cowen, Lenore J., Baym, Michael, Berger, Bonnie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2013
Materias:	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851851/ https://www.ncbi.nlm.nih.gov/pubmed/23812995 http://dx.doi.org/10.1093/bioinformatics/btt214

_version_	1782294367126421504
author	Daniels, Noah M. Gallant, Andrew Peng, Jian Cowen, Lenore J. Baym, Michael Berger, Bonnie
author_facet	Daniels, Noah M. Gallant, Andrew Peng, Jian Cowen, Lenore J. Baym, Michael Berger, Bonnie
author_sort	Daniels, Noah M.
collection	PubMed
description	Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/ Contact: bab@mit.edu
format	Online Article Text
id	pubmed-3851851
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-38518512013-12-05 Compressive genomics for protein databases Daniels, Noah M. Gallant, Andrew Peng, Jian Cowen, Lenore J. Baym, Michael Berger, Bonnie Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/ Contact: bab@mit.edu Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3851851/ /pubmed/23812995 http://dx.doi.org/10.1093/bioinformatics/btt214 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Daniels, Noah M. Gallant, Andrew Peng, Jian Cowen, Lenore J. Baym, Michael Berger, Bonnie Compressive genomics for protein databases
title	Compressive genomics for protein databases
title_full	Compressive genomics for protein databases
title_fullStr	Compressive genomics for protein databases
title_full_unstemmed	Compressive genomics for protein databases
title_short	Compressive genomics for protein databases
title_sort	compressive genomics for protein databases
topic	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851851/ https://www.ncbi.nlm.nih.gov/pubmed/23812995 http://dx.doi.org/10.1093/bioinformatics/btt214
work_keys_str_mv	AT danielsnoahm compressivegenomicsforproteindatabases AT gallantandrew compressivegenomicsforproteindatabases AT pengjian compressivegenomicsforproteindatabases AT cowenlenorej compressivegenomicsforproteindatabases AT baymmichael compressivegenomicsforproteindatabases AT bergerbonnie compressivegenomicsforproteindatabases

Compressive genomics for protein databases

Ejemplares similares