Cargando…
Fast batch searching for protein homology based on compression and clustering
BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since i...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/ https://www.ncbi.nlm.nih.gov/pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8 |
_version_ | 1783280541451157504 |
---|---|
author | Ge, Hongwei Sun, Liang Yu, Jinghong |
author_facet | Ge, Hongwei Sun, Liang Yu, Jinghong |
author_sort | Ge, Hongwei |
collection | PubMed |
description | BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods. |
format | Online Article Text |
id | pubmed-5697088 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-56970882017-12-01 Fast batch searching for protein homology based on compression and clustering Ge, Hongwei Sun, Liang Yu, Jinghong BMC Bioinformatics Research Article BACKGROUND: In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. RESULTS: We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS: It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods. BioMed Central 2017-11-21 /pmc/articles/PMC5697088/ /pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Ge, Hongwei Sun, Liang Yu, Jinghong Fast batch searching for protein homology based on compression and clustering |
title | Fast batch searching for protein homology based on compression and clustering |
title_full | Fast batch searching for protein homology based on compression and clustering |
title_fullStr | Fast batch searching for protein homology based on compression and clustering |
title_full_unstemmed | Fast batch searching for protein homology based on compression and clustering |
title_short | Fast batch searching for protein homology based on compression and clustering |
title_sort | fast batch searching for protein homology based on compression and clustering |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697088/ https://www.ncbi.nlm.nih.gov/pubmed/29162030 http://dx.doi.org/10.1186/s12859-017-1938-8 |
work_keys_str_mv | AT gehongwei fastbatchsearchingforproteinhomologybasedoncompressionandclustering AT sunliang fastbatchsearchingforproteinhomologybasedoncompressionandclustering AT yujinghong fastbatchsearchingforproteinhomologybasedoncompressionandclustering |