Cargando…

Using distances between Top-n-gram and residue pairs for protein remote homology detection

BACKGROUND: Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Bin, Xu, Jinghao, Zou, Quan, Xu, Ruifeng, Wang, Xiaolong, Chen, Qingcai
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015815/ https://www.ncbi.nlm.nih.gov/pubmed/24564580 http://dx.doi.org/10.1186/1471-2105-15-S2-S3

_version_	1782315405297057792
author	Liu, Bin Xu, Jinghao Zou, Quan Xu, Ruifeng Wang, Xiaolong Chen, Qingcai
author_facet	Liu, Bin Xu, Jinghao Zou, Quan Xu, Ruifeng Wang, Xiaolong Chen, Qingcai
author_sort	Liu, Bin
collection	PubMed
description	BACKGROUND: Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods. RESULTS: Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families. CONCLUSION: The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp
format	Online Article Text
id	pubmed-4015815
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40158152014-05-23 Using distances between Top-n-gram and residue pairs for protein remote homology detection Liu, Bin Xu, Jinghao Zou, Quan Xu, Ruifeng Wang, Xiaolong Chen, Qingcai BMC Bioinformatics Proceedings BACKGROUND: Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods. RESULTS: Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families. CONCLUSION: The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp BioMed Central 2014-01-24 /pmc/articles/PMC4015815/ /pubmed/24564580 http://dx.doi.org/10.1186/1471-2105-15-S2-S3 Text en Copyright © 2014 Liu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Liu, Bin Xu, Jinghao Zou, Quan Xu, Ruifeng Wang, Xiaolong Chen, Qingcai Using distances between Top-n-gram and residue pairs for protein remote homology detection
title	Using distances between Top-n-gram and residue pairs for protein remote homology detection
title_full	Using distances between Top-n-gram and residue pairs for protein remote homology detection
title_fullStr	Using distances between Top-n-gram and residue pairs for protein remote homology detection
title_full_unstemmed	Using distances between Top-n-gram and residue pairs for protein remote homology detection
title_short	Using distances between Top-n-gram and residue pairs for protein remote homology detection
title_sort	using distances between top-n-gram and residue pairs for protein remote homology detection
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015815/ https://www.ncbi.nlm.nih.gov/pubmed/24564580 http://dx.doi.org/10.1186/1471-2105-15-S2-S3
work_keys_str_mv	AT liubin usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection AT xujinghao usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection AT zouquan usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection AT xuruifeng usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection AT wangxiaolong usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection AT chenqingcai usingdistancesbetweentopngramandresiduepairsforproteinremotehomologydetection

Using distances between Top-n-gram and residue pairs for protein remote homology detection

Ejemplares similares