Cargando…

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

BACKGROUND: Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Bin, Wang, Xiaolong, Lin, Lei, Dong, Qiwen, Wang, Xuan
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613933/ https://www.ncbi.nlm.nih.gov/pubmed/19046430 http://dx.doi.org/10.1186/1471-2105-9-510

_version_	1782163215191375872
author	Liu, Bin Wang, Xiaolong Lin, Lei Dong, Qiwen Wang, Xuan
author_facet	Liu, Bin Wang, Xiaolong Lin, Lei Dong, Qiwen Wang, Xuan
author_sort	Liu, Bin
collection	PubMed
description	BACKGROUND: Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. RESULTS: In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. CONCLUSION: The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
format	Text
id	pubmed-2613933
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26139332009-01-12 A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis Liu, Bin Wang, Xiaolong Lin, Lei Dong, Qiwen Wang, Xuan BMC Bioinformatics Research Article BACKGROUND: Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. RESULTS: In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. CONCLUSION: The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites. BioMed Central 2008-12-01 /pmc/articles/PMC2613933/ /pubmed/19046430 http://dx.doi.org/10.1186/1471-2105-9-510 Text en Copyright © 2008 Liu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Liu, Bin Wang, Xiaolong Lin, Lei Dong, Qiwen Wang, Xuan A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title	A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title_full	A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title_fullStr	A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title_full_unstemmed	A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title_short	A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
title_sort	discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613933/ https://www.ncbi.nlm.nih.gov/pubmed/19046430 http://dx.doi.org/10.1186/1471-2105-9-510
work_keys_str_mv	AT liubin adiscriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT wangxiaolong adiscriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT linlei adiscriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT dongqiwen adiscriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT wangxuan adiscriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT liubin discriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT wangxiaolong discriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT linlei discriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT dongqiwen discriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis AT wangxuan discriminativemethodforproteinremotehomologydetectionandfoldrecognitioncombiningtopngramsandlatentsemanticanalysis

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Ejemplares similares