Cargando…

Word correlation matrices for protein sequence analysis and remote homology detection

BACKGROUND: Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lingner, Thomas, Meinicke, Peter
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2438326/ https://www.ncbi.nlm.nih.gov/pubmed/18522726 http://dx.doi.org/10.1186/1471-2105-9-259

_version_	1782156507449655296
author	Lingner, Thomas Meinicke, Peter
author_facet	Lingner, Thomas Meinicke, Peter
author_sort	Lingner, Thomas
collection	PubMed
description	BACKGROUND: Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive. RESULTS: In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection. CONCLUSION: Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.
format	Text
id	pubmed-2438326
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-24383262008-06-25 Word correlation matrices for protein sequence analysis and remote homology detection Lingner, Thomas Meinicke, Peter BMC Bioinformatics Research Article BACKGROUND: Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive. RESULTS: In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection. CONCLUSION: Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases. BioMed Central 2008-06-03 /pmc/articles/PMC2438326/ /pubmed/18522726 http://dx.doi.org/10.1186/1471-2105-9-259 Text en Copyright © 2008 Lingner and Meinicke; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Lingner, Thomas Meinicke, Peter Word correlation matrices for protein sequence analysis and remote homology detection
title	Word correlation matrices for protein sequence analysis and remote homology detection
title_full	Word correlation matrices for protein sequence analysis and remote homology detection
title_fullStr	Word correlation matrices for protein sequence analysis and remote homology detection
title_full_unstemmed	Word correlation matrices for protein sequence analysis and remote homology detection
title_short	Word correlation matrices for protein sequence analysis and remote homology detection
title_sort	word correlation matrices for protein sequence analysis and remote homology detection
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2438326/ https://www.ncbi.nlm.nih.gov/pubmed/18522726 http://dx.doi.org/10.1186/1471-2105-9-259
work_keys_str_mv	AT lingnerthomas wordcorrelationmatricesforproteinsequenceanalysisandremotehomologydetection AT meinickepeter wordcorrelationmatricesforproteinsequenceanalysisandremotehomologydetection

Word correlation matrices for protein sequence analysis and remote homology detection

Ejemplares similares