Cargando…

Fast batch searching for protein homology based on compression and clustering

BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since i...

Descripción completa

Detalles Bibliográficos
Autores principales: Ge, Hongwei, Sun, Liang, Yu, Jinghong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/
https://www.ncbi.nlm.nih.gov/pubmed/29162030
http://dx.doi.org/10.1186/s12859-017-1938-8
_version_ 1783280541451157504
author Ge, Hongwei
Sun, Liang
Yu, Jinghong
author_facet Ge, Hongwei
Sun, Liang
Yu, Jinghong
author_sort Ge, Hongwei
collection PubMed
description BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.
format Online
Article
Text
id pubmed-5697088
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-56970882017-12-01 Fast batch searching for protein homology based on compression and clustering Ge, Hongwei Sun, Liang Yu, Jinghong BMC Bioinformatics Research Article BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods. BioMed Central 2017-11-21 /pmc/articles/PMC5697088/ /pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Ge, Hongwei
Sun, Liang
Yu, Jinghong
Fast batch searching for protein homology based on compression and clustering
title Fast batch searching for protein homology based on compression and clustering
title_full Fast batch searching for protein homology based on compression and clustering
title_fullStr Fast batch searching for protein homology based on compression and clustering
title_full_unstemmed Fast batch searching for protein homology based on compression and clustering
title_short Fast batch searching for protein homology based on compression and clustering
title_sort fast batch searching for protein homology based on compression and clustering
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/
https://www.ncbi.nlm.nih.gov/pubmed/29162030
http://dx.doi.org/10.1186/s12859-017-1938-8
work_keys_str_mv AT gehongwei fastbatchsearchingforproteinhomologybasedoncompressionandclustering
AT sunliang fastbatchsearchingforproteinhomologybasedoncompressionandclustering
AT yujinghong fastbatchsearchingforproteinhomologybasedoncompressionandclustering