Cargando…

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

BACKGROUND: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related prote...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dai, Qi, Wang, Tianming
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2571980/ https://www.ncbi.nlm.nih.gov/pubmed/18811946 http://dx.doi.org/10.1186/1471-2105-9-394

_version_	1782160226259042304
author	Dai, Qi Wang, Tianming
author_facet	Dai, Qi Wang, Tianming
author_sort	Dai, Qi
collection	PubMed
description	BACKGROUND: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure). RESULTS: We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained. CONCLUSION: Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.
format	Text
id	pubmed-2571980
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-25719802008-10-23 Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' Dai, Qi Wang, Tianming BMC Bioinformatics Methodology Article BACKGROUND: Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure). RESULTS: We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained. CONCLUSION: Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison. BioMed Central 2008-09-23 /pmc/articles/PMC2571980/ /pubmed/18811946 http://dx.doi.org/10.1186/1471-2105-9-394 Text en Copyright © 2008 Dai and Wang; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Dai, Qi Wang, Tianming Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title	Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title_full	Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title_fullStr	Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title_full_unstemmed	Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title_short	Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
title_sort	comparison study on k-word statistical measures for protein: from sequence to 'sequence space'
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2571980/ https://www.ncbi.nlm.nih.gov/pubmed/18811946 http://dx.doi.org/10.1186/1471-2105-9-394
work_keys_str_mv	AT daiqi comparisonstudyonkwordstatisticalmeasuresforproteinfromsequencetosequencespace AT wangtianming comparisonstudyonkwordstatisticalmeasuresforproteinfromsequencetosequencespace

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

Ejemplares similares