Cargando…

Fast batch searching for protein homology based on compression and clustering

BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ge, Hongwei, Sun, Liang, Yu, Jinghong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/ https://www.ncbi.nlm.nih.gov/pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8

_version_	1783280541451157504
author	Ge, Hongwei Sun, Liang Yu, Jinghong
author_facet	Ge, Hongwei Sun, Liang Yu, Jinghong
author_sort	Ge, Hongwei
collection	PubMed
description	BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.
format	Online Article Text
id	pubmed-5697088
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-56970882017-12-01 Fast batch searching for protein homology based on compression and clustering Ge, Hongwei Sun, Liang Yu, Jinghong BMC Bioinformatics Research Article BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods. BioMed Central 2017-11-21 /pmc/articles/PMC5697088/ /pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Ge, Hongwei Sun, Liang Yu, Jinghong Fast batch searching for protein homology based on compression and clustering
title	Fast batch searching for protein homology based on compression and clustering
title_full	Fast batch searching for protein homology based on compression and clustering
title_fullStr	Fast batch searching for protein homology based on compression and clustering
title_full_unstemmed	Fast batch searching for protein homology based on compression and clustering
title_short	Fast batch searching for protein homology based on compression and clustering
title_sort	fast batch searching for protein homology based on compression and clustering
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/ https://www.ncbi.nlm.nih.gov/pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8
work_keys_str_mv	AT gehongwei fastbatchsearchingforproteinhomologybasedoncompressionandclustering AT sunliang fastbatchsearchingforproteinhomologybasedoncompressionandclustering AT yujinghong fastbatchsearchingforproteinhomologybasedoncompressionandclustering

Fast batch searching for protein homology based on compression and clustering

Ejemplares similares